1. How is Data Science Different from Traditional Data Analysis?
Ans:
Data science focuses on extracting actionable insights from vast and complex datasets using advanced technologies like Python, machine learning, and statistical methods. In contrast, traditional data analysis mainly deals with examining past data trends and creating basic reports, while data science goes further by forecasting future outcomes and solving intricate problems through algorithms and programming.
2. What distinguishes supervised learning from unsupervised learning?
Ans:
Supervised learning involves training a model on labeled data where both input and output are known, much like learning under direct supervision. Unsupervised learning works with unlabeled data, allowing the model to detect hidden patterns or clusters, similar to identifying groups of people with common interests without prior information.
3. What is overfitting in machine learning, and how can you prevent it?
Ans:
Overfitting occurs when a model learns the training data too precisely, including noise and errors, leading to poor generalization on new data. Common ways to avoid overfitting include using cross-validation, simplifying the model, and applying regularization techniques.
4. Explain the bias-variance tradeoff.
Ans:
Bias is error from making wrong assumptions; variance is error from being too sensitive to small changes in data. A good model balances both too much bias leads to underfitting, while too much variance leads to overfitting.
5. What Are the Key Differences Between Python and R in Data Science?
Ans:
Python is versatile and widely used for building data-driven applications and machine learning models, favored by many industries. R is specialized in statistical analysis and data visualization, making it a preferred choice in academia and research.
6. How do you handle missing data in a dataset?
Ans:
Missing data can be addressed by deleting records with missing values, filling in missing spots with averages or the most common values, or using sophisticated techniques like interpolation or predictive modeling to estimate the missing information.
7. Explain the concept of feature engineering.
Ans:
Feature engineering means creating new input variables (features) from existing data to help the model learn better. It involves cleaning data, transforming values, and combining features to improve prediction accuracy.
8. What is the difference between a classification and a regression problem?
Ans:
In classification, you predict categories like “spam” or “not spam.” In regression, you predict continuous values like a house price or temperature. Both are types of supervised learning but solve different problems.
9. What is a confusion matrix in classification?
Ans:
A confusion matrix shows how well a classification model performs by comparing actual values with predicted ones. It breaks results into categories like true positives, false positives, true negatives, and false negatives.
10. What are precision and recall?
Ans:
Precision tells you how many of the predicted positives are actually correct. Recall indicates how many of the real benefits were correctly predicted. They help measure how well a model finds the right results.
11. What is cross-validation, and why is it important?
Ans:
Cross-validation tests a model’s accuracy by dividing data into parts, training on some, and testing on others. It helps make sure the model performs well on unseen data and avoids overfitting.
12. What is regularization used for in machine learning?
Ans:
Regularization adds a penalty for complexity in a model’s loss function to discourage overfitting. It promotes simpler models that generalize better to new data.
13. What is a decision tree, and how does it work?
Ans:
A decision tree is a predictive model that splits data based on feature values through a series of decision rules arranged like a flowchart, ultimately leading to classification or regression outputs at its leaves.
14. What are the differences between bagging and boosting?
Ans:
Bagging builds multiple models independently and combines them to improve accuracy. Boosting builds models one after another, each one learning from the mistakes of the previous one to improve performance.
15. What is dimensionality reduction, and why is it important?
Ans:
Dimensionality reduction involves reducing the number of input variables while retaining essential information, which speeds up training, reduces noise, and improves model performance, especially in high-dimensional datasets.