1. What distinguishes data science from conventional data analysis?
Ans:
Data science involves extracting meaningful insights from large, complex datasets using advanced tools like Python, machine learning, and statistics. Traditional data analysis focuses more on analyzing historical trends and generating basic reports, while data science goes deeper by predicting future outcomes and solving complex problems through algorithms and programming.
2. What is the difference between supervised and unsupervised learning?
Ans:
In supervised learning, the model is trained on labeled data where both inputs and outputs are known, similar to learning with a teacher’s guidance. Unsupervised learning involves working with unlabeled data, where the model tries to discover hidden patterns or groups, like identifying people with shared interests in a crowd without prior knowledge.
3. What is overfitting in machine learning, and how can you prevent it?
Ans:
When a model learns the training data too well it is said to be overfit including noise and errors, which makes it perform poorly on new data. It can be avoided by employing strategies such as cross-validation, reducing model complexity, or applying regularization.
4. Explain the bias-variance tradeoff.
Ans:
Bias is error from making wrong assumptions; variance is error from being too sensitive to small changes in data. A good model balances both too much bias leads to underfitting, while too much variance leads to overfitting.
5. In terms of data science, what are the main distinctions between R and Python?
Ans:
Python is great for building data applications and machine learning models. It’s more versatile and widely used in the industry. R is better for statistical analysis and data visualization, especially in academic or research settings.
6. How do you handle missing data in a dataset?
Ans:
Missing data can be handled by deleting rows with missing values, filling gaps with averages or most frequent values, or using advanced methods like interpolation or predictive models to estimate missing entries.
7. What does feature engineering involve?
Ans:
Feature engineering is the process of creating new variables from existing data to enhance the model’s learning. It includes cleaning, transforming, and combining features to improve prediction accuracy.
8. What is the difference between a classification and a regression problem?
Ans:
In classification, you predict categories like “spam” or “not spam.” In regression, you predict continuous values like a house price or temperature. Both are types of supervised learning but solve different problems.
9. What is a confusion matrix in classification?
Ans:
A confusion matrix shows how well a classification model performs by comparing actual values with predicted ones. It breaks results into categories like true positives, false positives, true negatives, and false negatives.
10. What are precision and recall?
Ans:
Precision tells you how many of the predicted positives are actually correct. Recall indicates how many of the real benefits were correctly predicted. They help measure how well a model finds the right results.