1. How does data science differ from traditional data analysis?
Ans:
Data science involves using advanced tools and techniques like Python, machine learning, and statistical modeling to extract actionable insights from large and complex datasets. Traditional data analysis, on the other hand, focuses more on summarizing historical data and generating basic reports. While traditional analysis explains past trends, data science goes further by predicting future outcomes and solving complex problems through algorithms and programming.
2. What distinguishes supervised learning from unsupervised learning?
Ans:
Supervised learning involves training a model on labeled data where both input and output are known, much like learning under direct supervision. Unsupervised learning works with unlabeled data, allowing the model to detect hidden patterns or clusters, similar to identifying groups of people with common interests without prior information.
3. What is overfitting in machine learning, and how can you prevent it?
Ans:
Overfitting occurs when a model learns the training data too precisely, including noise and errors, leading to poor generalization on new data. Common ways to avoid overfitting include using cross-validation, simplifying the model, and applying regularization techniques.
4. Explain the bias-variance tradeoff.
Ans:
Bias refers to errors caused by oversimplified assumptions in a model, while variance refers to errors from the model being overly sensitive to the training data. A well-balanced model minimizes both bias and variance to ensure good performance on both training and test data avoiding underfitting and overfitting.
5. How do Python and R compare in data science applications?
Ans:
Python is a general-purpose programming language widely adopted in industry for building machine learning models and data-driven applications. R is more focused on statistical computing and data visualization, making it a preferred tool in academic and research settings.
6. How do you handle missing data in a dataset?
Ans:
Missing data can be handled in several ways: by removing affected rows, filling in gaps using statistical measures like mean or mode, or applying advanced techniques such as interpolation or predictive modeling to estimate missing values based on the rest of the dataset.
7. Explain the concept of feature engineering.
Ans:
Feature engineering means creating new input variables (features) from existing data to help the model learn better. It involves cleaning data, transforming values, and combining features to improve prediction accuracy.
8. What is the difference between a classification and a regression problem?
Ans:
In classification, you predict categories like “spam” or “not spam.” In regression, you predict continuous values like a house price or temperature. Both are types of supervised learning but solve different problems.
9. What is a confusion matrix in classification?
Ans:
A confusion matrix shows how well a classification model performs by comparing actual values with predicted ones. It breaks results into categories like true positives, false positives, true negatives, and false negatives.
10. What are precision and recall?
Ans:
Precision is the ratio of correct positive predictions to the total number of positive predictions made by the model. Recall, also known as sensitivity, measures the proportion of actual positive cases that were successfully identified by the model.
11. What is cross-validation, and why is it important?
Ans:
Cross-validation tests a model’s accuracy by dividing data into parts, training on some, and testing on others. It helps make sure the model performs well on unseen data and avoids overfitting.
12. What is regularization used for in machine learning?
Ans:
Regularization helps prevent overfitting by adding a penalty to the model’s loss function, discouraging it from becoming too complex. It promotes simpler, more generalizable models, with common techniques including L1 (Lasso) and L2 (Ridge) regularization.
13. What is a decision tree, and how does it work?
Ans:
A decision tree is a predictive model that splits data based on feature values through a series of decision rules arranged like a flowchart, ultimately leading to classification or regression outputs at its leaves.
14. What are the differences between bagging and boosting?
Ans:
Bagging (Bootstrap Aggregating) builds multiple models in parallel using random subsets of data and averages their outputs to improve accuracy and reduce variance. Boosting, on the other hand, trains models sequentially, with each new model focusing on correcting the errors made by the previous one, leading to improved performance.
15. What is dimensionality reduction, and why is it important?
Ans:
Dimensionality reduction involves reducing the number of input variables while retaining essential information, which speeds up training, reduces noise, and improves model performance, especially in high-dimensional datasets.