1. What distinguishes data science from conventional data analysis?
Ans:
Data science involves extracting meaningful insights from large, complex datasets using advanced tools like Python, machine learning, and statistics. Traditional data analysis focuses more on analyzing historical trends and generating basic reports, while data science goes deeper by predicting future outcomes and solving complex problems through algorithms and programming.
2. What is the difference between supervised and unsupervised learning?
Ans:
In supervised learning, the model is trained on labeled data where both inputs and outputs are known, similar to learning with a teacher’s guidance. Unsupervised learning involves working with unlabeled data, where the model tries to discover hidden patterns or groups, like identifying people with shared interests in a crowd without prior knowledge.
3. What is overfitting in machine learning, and how can you prevent it?
Ans:
When a model learns the training data too well it is said to be overfit including noise and errors, which makes it perform poorly on new data. It can be avoided by employing strategies such as cross-validation, reducing model complexity, or applying regularization.
4. Explain the bias-variance tradeoff.
Ans:
Bias is error from making wrong assumptions; variance is error from being too sensitive to small changes in data. A good model balances both too much bias leads to underfitting, while too much variance leads to overfitting.
5. In terms of data science, what are the main distinctions between R and Python?
Ans:
Python is great for building data applications and machine learning models. It’s more versatile and widely used in the industry. R is better for statistical analysis and data visualization, especially in academic or research settings.
6. How do you handle missing data in a dataset?
Ans:
Missing data can be handled by deleting rows with missing values, filling gaps with averages or most frequent values, or using advanced methods like interpolation or predictive models to estimate missing entries.
7. What does feature engineering involve?
Ans:
Feature engineering is the process of creating new variables from existing data to enhance the model’s learning. It includes cleaning, transforming, and combining features to improve prediction accuracy.
8. What is the difference between a classification and a regression problem?
Ans:
In classification, you predict categories like “spam” or “not spam.” In regression, you predict continuous values like a house price or temperature. Both are types of supervised learning but solve different problems.
9. What is a confusion matrix in classification?
Ans:
A confusion matrix shows how well a classification model performs by comparing actual values with predicted ones. It breaks results into categories like true positives, false positives, true negatives, and false negatives.
10. What are precision and recall?
Ans:
Precision tells you how many of the predicted positives are actually correct. Recall indicates how many of the real benefits were correctly predicted. They help measure how well a model finds the right results.
11. What is cross-validation, and why is it important?
Ans:
Cross-validation tests a model’s accuracy by dividing data into parts, training on some, and testing on others. It helps make sure the model performs well on unseen data and avoids overfitting.
12. What is regularization used for in machine learning?
Ans:
Regularization reduces overfitting by adding a penalty to complex models. It helps keep the model simple and improves performance on new, unseen data.
13. What is a decision tree, and how does it work?
Ans:
One type of model that separates data into branches for decision-making based on conditions. It works like a flowchart: each question leads to a new split until a final decision or prediction is made.
14. What are the differences between bagging and boosting?
Ans:
Bagging builds multiple models independently and combines them to improve accuracy. Boosting builds models one after another, each one learning from the mistakes of the previous one to improve performance.
15. What is dimensionality reduction, and why is it important?
Ans:
Dimensionality reduction means reducing the number of input features while keeping important information. It makes models faster and easier to train, especially when working with high-dimensional data.