1. What is Data Science, and how is it different from traditional data analysis?
Ans:
Data Science is the practice of gathering, cleaning, analyzing, and applying data to make predictions or guide decisions. It combines machine learning, big data technologies, and data visualization. Unlike traditional data analysis, which primarily examines past trends, Data Science also builds predictive models to anticipate future events and outcomes.
2.What differentiates supervised learning from unsupervised learning?
Ans:
In supervised learning, the dataset contains labeled examples, and the model learns to associate inputs with the correct outputs. In unsupervised learning, the dataset is unlabeled, and the model uncovers hidden patterns, clusters, or relationships without prior labels.
3. What is overfitting and how can we stop it?
Ans:
Overfitting happens if a model performs badly on new data because it has learned too much from the training data, including the noise. You can prevent it by using simpler models, cross-validation, or regularization techniques.
4. What is the bias-variance tradeoff?
Ans:
Bias is the error from wrong assumptions in the model, and variance is the error from too much sensitivity to the training data. A good model finds a balance between bias and variance to perform well on both training and test data.
5. How are Python and R different for Data Science?
Ans:
Python is a versatile language widely used for building machine learning models, handling large-scale data, and integrating with production systems. R excels in statistical analysis, in-depth data exploration, and quick visualizations. Python is more general-purpose, while R is more specialized for statistical tasks.
6. How do we deal with missing data?
Ans:
You can handle missing data by removing the rows, filling in missing values with the mean or median, or using algorithms that can handle missing data. The method depends on how much and what type of data is missing.
7. What does feature engineering mean?
Ans:
Feature engineering is the process of creating new useful input features or modifying existing ones to improve model performance. It helps the model understand the data better.
8. How is classification different from regression?
Ans:
Classification predicts categories like "yes or no" or "spam or not spam." Regression predicts continuous values like house prices or temperatures.
9. What is a confusion matrix in classification?
Ans:
A confusion matrix shows how well a classification model performs by comparing predicted results with actual results. It includes values like true positives, false positives, true negatives, and false negatives.
10. What do precision and recall mean?
Ans:
The precision shows the percentage of predicted positive outcomes that were accurate. Recall indicates the proportion of real positive cases that the model detected.
11. Why is cross-validation used?
Ans:
Cross-validation helps check how well a model works on different parts of the data. It prevents overfitting and gives a better idea of the model's true performance.
12. Why do we use regularization in machine learning?
Ans:
Regularization adds a penalty to a model’s complexity to prevent it from fitting noise in the training data. It encourages simpler, more generalizable models and reduces overfitting.
13. What is a decision tree and how does it work?
Ans:
A decision tree separates data into branches according to conditions, like a flowchart. It helps in making decisions by following these branches to reach a final prediction.
14. What is bagging different from boosting?
Ans:
Bagging builds multiple models independently and combines their results to improve accuracy. Boosting builds models one after another, focusing on the errors of the previous ones to improve performance.
15. What is dimensionality reduction and why is it useful?
Ans:
The process of reducing a dataset's dimensionality involves smaller features. It makes models faster and reduces overfitting while keeping the most important information.