1. What distinguishes data science from conventional data analysis?
Ans:
Data Science is the process of extracting useful insights from large and complex sets of data using modern tools like Python, machine learning, and statistics. Traditional data analysis focuses more on past trends and basic reports, while Data Science looks deeper predicting outcomes and solving complex problems using algorithms and programming.
2. How does supervised learning differ from unsupervised learning?
Ans:
When learning under supervision the model uses labeled data where both input and correct output are given. It's like learning with a teacher. In unsupervised learning the model works with unlabeled data to find patterns or groupings like finding friends with common interests in a crowd without knowing anyone.
3. What is overfitting in machine learning, and how can you prevent it?
Ans:
When a model learns the training data too well it is said to be overfit including noise and errors, which makes it perform poorly on new data. It can be avoided by employing strategies such as cross-validation, reducing model complexity, or applying regularization.
4. Explain the bias-variance tradeoff.
Ans:
Bias is error from making wrong assumptions; variance is error from being too sensitive to small changes in data. A good model balances both too much bias leads to underfitting, while too much variance leads to overfitting.
5. In terms of data science, what are the main distinctions between R and Python?
Ans:
Python is great for building data applications and machine learning models. It’s more versatile and widely used in the industry. R is better for statistical analysis and data visualization, especially in academic or research settings.
6. How do you handle missing data in a dataset?
Ans:
You can handle missing data by removing rows with missing values, filling them with averages or most frequent values, or using advanced techniques like interpolation or model-based prediction to estimate missing values.
7. Explain the concept of feature engineering.
Ans:
Feature engineering means creating new input variables (features) from existing data to help the model learn better. It involves cleaning data, transforming values, and combining features to improve prediction accuracy.
8. What is the difference between a classification and a regression problem?
Ans:
In classification, you predict categories like “spam” or “not spam.” In regression, you predict continuous values like a house price or temperature. Both are types of supervised learning but solve different problems.
9. What is a confusion matrix in classification?
Ans:
A confusion matrix shows how well a classification model performs by comparing actual values with predicted ones. It breaks results into categories like true positives, false positives, true negatives, and false negatives.
10. What are precision and recall?
Ans:
Precision tells you how many of the predicted positives are actually correct. Recall indicates how many of the real benefits were correctly predicted. They help measure how well a model finds the right results.
11. What is cross-validation, and why is it important?
Ans:
Cross-validation tests a model’s accuracy by dividing data into parts, training on some, and testing on others. It helps make sure the model performs well on unseen data and avoids overfitting.
12. What is regularization used for in machine learning?
Ans:
Regularization reduces overfitting by adding a penalty to complex models. It helps keep the model simple and improves performance on new, unseen data.
13. What is a decision tree, and how does it work?
Ans:
One type of model that separates data into branches for decision-making based on conditions. It works like a flowchart: each question leads to a new split until a final decision or prediction is made.
14. What are the differences between bagging and boosting?
Ans:
Bagging builds multiple models independently and combines them to improve accuracy. Boosting builds models one after another, each one learning from the mistakes of the previous one to improve performance.
15. What is dimensionality reduction, and why is it important?
Ans:
Dimensionality reduction means reducing the number of input features while keeping important information. It makes models faster and easier to train, especially when working with high-dimensional data.