1. What is Data Science vs traditional analysis?
Ans:
Data Science involves collecting, cleaning, analyzing, and applying data to make predictions or decisions. It encompasses fields like machine learning, big data, and visualization. Unlike traditional data analysis, which focuses on identifying patterns in past data, Data Science goes further by building models to forecast future trends.
2. Supervised vs unsupervised learning?
Ans:
Supervised learning uses labeled data where the outcomes are known, allowing the model to learn to predict those results. Unsupervised learning deals with unlabeled data, where the model tries to discover hidden structures or groupings on its own.
3. What is overfitting in models?
Ans:
Overfitting occurs when a model learns the training data, including noise, too well, causing poor performance on new data. It can be avoided by using simpler models, applying cross-validation techniques, or employing regularization methods.
4. Explain bias-variance tradeoff.
Ans:
Bias is the error caused by incorrect assumptions in the model, while variance refers to sensitivity to fluctuations in the training data. An effective model balances bias and variance to achieve good accuracy on both training and unseen data.
5.Python vs R in Data Science?
Ans:
Python is widely used for building machine learning models and handling large datasets, making it versatile. R excels in statistical analysis and rapid data visualization. Python serves as a general-purpose language, while R is more specialized for statistics.
6. How to handle missing data?
Ans:
Missing data can be addressed by removing affected rows, imputing missing values with mean or median, or using algorithms that can work with incomplete data. The choice depends on the quantity and nature of the missing information.
7. Define feature engineering.
Ans:
Feature engineering involves creating or modifying input variables to improve a model’s predictive power. It helps the model better capture important patterns in the data.
8. Classification vs regression tasks?
Ans:
Classification predicts discrete categories, such as “spam” or “not spam,” while regression predicts continuous numerical outcomes like prices or temperatures.
9. What is a confusion matrix?
Ans:
A confusion matrix is a table that compares predicted classifications with actual results, detailing true positives, false positives, true negatives, and false negatives to evaluate model performance.
10. Define precision and recall?
Ans:
Precision measures the accuracy of positive predictions and how many predicted positives were actually correct. Recall measures how many actual positives the model successfully identified.
11.Importance of cross-validation?
Ans:
Cross-validation assesses a model’s ability to generalize by testing it on different subsets of data. It helps prevent overfitting and provides a more reliable estimate of performance.
12.Purpose of regularization?
Ans:
Regularization adds a penalty for complexity to the model, encouraging simpler solutions that generalize better and reducing the risk of overfitting.
13. What is a decision tree?
Ans:
A decision tree splits data into branches based on conditions, like a flowchart, leading to decisions or predictions at the leaves. It simplifies complex decision-making processes.
14. Bagging vs boosting methods?
Ans:
Bagging builds multiple models independently and combines their outputs to improve accuracy, while boosting builds models sequentially, each focusing on correcting the errors of the previous ones to enhance performance.
15. Define dimensionality reduction.
Ans:
Dimensionality reduction involves reducing the number of features in a dataset while retaining important information. It speeds up modeling, reduces overfitting, and improves model efficiency.