1. What does a data scientist do in a company?
Ans:
A data scientist helps the company make better decisions using data. They collect data, find patterns, build models, and share useful insights with teams.
2. How is structured data different from unstructured data?
Ans:
Structured data, such as that that exists in databases or Excel, is arranged in rows and columns. Unstructured data includes things like emails, videos, images, or text, which aren’t stored in a fixed format.
3. What are the main steps in a data science project?
Ans:
A data science project usually follows these steps:
- Understand the problem
- Collect data
- Clean the data
- Explore and analyze it
- Build a model
- Test it
- Share the results
4. How do you deal with missing values in a dataset?
Ans:
You can remove rows with missing data, fill them using averages or most common values, or use algorithms that can handle missing data automatically.
5. How does supervised learning differ from unsupervised learning?
Ans:
In supervised learning, the data has labels (like price, category). In unsupervised learning, the data has no labels, and the goal is to find hidden patterns or groups.
6. What is cross-validation and why is it used?
Ans:
Cross-validation is a method to check if your model works well on different data. To achieve a fair result, it divides the data into sections and runs the model across numerous tests.
7. What does overfitting mean, and how can it be prevented?
Ans:
A model is deemed to be overfit when it learns too much from training data, including noise, and performs badly on fresh data. To avoid it, you can simplify the model, use more data, or apply techniques like regularization.
8. What is a confusion matrix and what does it show?
Ans:
A confusion matrix is a table that shows how well your classification model performed. It includes:
- True Positives (correct positives)
- False Positives (wrongly predicted as positive)
- True Negatives (correct negatives)
- False Negatives (wrongly predicted as negative)
9. How do you pick the most important features from data?
Ans:
You can use methods like correlation, feature importance from models (like Random Forest), or remove features one by one to see which ones matter most.
10. How does the K-Nearest Neighbors (KNN) algorithm work?
Ans:
KNN looks at the 'K' closest data points to the one you're trying to predict. It then gives the new point a value or label based on what most of those neighbors are.
11. How does a decision tree algorithm work?
Ans:
To divide the data, a decision tree provides a series of yes/no questions. At each step, it chooses the question that best separates the data into groups.
12. What is Random Forest and how is it better than a single decision tree?
Ans:
Random Forest builds a great deal of decision trees and aggregates their output. It’s more accurate and stable because it reduces errors and avoids overfitting.
13. What is Support Vector Machine (SVM) and how is it used?
Ans:
SVM is a model that draws a line (or boundary) to separate data into classes. It works well for both simple and complex problems like face detection or email spam filtering.
14. What’s the difference between bagging and boosting?
Ans:
Indexing in MongoDB helps find data faster. Bagging builds multiple models independently and combines their results to improve accuracy. Boosting builds models one after another, each learning from the mistakes of the last, to make the final model stronger.
15. How does the Naive Bayes algorithm work?
Ans:
Indexing in MongoDB helps find data faster. Naive Bayes predicts outcomes using probability. It assumes features are independent and uses past data to calculate the chance of something happening (like spam detection).