1. What is Data Science and how is it different from regular data analysis?
Ans:
To gain valuable insights from data, data scientists integrate statistics, programming and subject expertise. Unlike regular data analysis, which mainly looks at historical data to create reports, Data Science goes a step further by using techniques like machine learning to make predictions about future outcomes.
2. How does supervised learning differ from unsupervised learning?
Ans:
Supervised learning uses labeled data where the correct answers are already known to train a model to make predictions. At the same time, unsupervised learning uses unlabeled data and looks for hidden patterns, groupings, or structures without producing any expected outcomes.
3. What is overfitting in machine learning and how can it be prevented?
Ans:
When model learns the noise in training data as well as the helpful patterns, is known as overfitting. It performs well on known data but poorly on new data. To prevent overfitting, we can simplify the model, gather more training data, or use techniques like regularization and cross-validation.
4. Can you explain bias and variance in simple terms?
Ans:
Bias refers to errors from overly simple models that fail to capture complex patterns this causes underfitting. Variance refers to models that are too complex and react too strongly to small changes in training data this is reason for overfitting. To perform successfully on the fresh data, good model balances the variance and bias.
5. How do R and Python differ from one another in data science?
Ans:
Python is widely used to building machine learning models and applications due to simplicity and versatility. R is often preferred for deep statistical analysis and creating detailed visualizations. While Python is common industry and tech, R is popular in academic and research fields.
6. How do you handle missing values in a dataset?
Ans:
To deal with missing data, we can either remove the incomplete rows or fill in the gaps using methods like replacing with the mean, median or mode. In some cases, predictive techniques like using KNN or regression can estimate the missing values.
7. What is feature engineering and why is it important?
Ans:
The process of developing, altering, or choosing data characteristics to enhance a model's performance is known as feature engineering. This might involve combining columns, extracting new variables or converting data into formats that are easier for algorithms to understand.
8. What’s the difference between classification and regression problems?
Ans:
Classification is used to predict categories or labels, such as “yes” or “no,” “spam” or “not spam.” Regression is used to predict continuous values like temperature, price or age.
9. What is a confusion matrix used for?
Ans:
A confusion matrix is table that shows how well a classification model is performing. It displays number of correct and incorrect predictions, helping to measure accuracy, precision, recall and other performance metrics.
10. What do precision and recall mean?
Ans:
Precision measures how many of the model’s favorable predictions were actually right. Recall shows how many of actual positive cases were identified correctly by the model. Both are key to evaluating a classification model’s effectiveness.