1. What does a data scientist do in a company?
Ans:
A data scientist helps a company make better decisions using data. They collect and clean data, analyze it to find patterns or trends, and build models to predict future outcomes. Their main job is to turn raw data into useful insights that help solve business problems or improve performance.
2. What is the difference between structured and unstructured data?
Ans:
Structured data is organized and easy to store in tables, like numbers, dates, or categories in spreadsheets. Unstructured data doesn’t follow a fixed format it includes things like emails, videos, images, and text documents. Analyzing organized data is simple, but unstructured data requires specialized equipment and techniques.
3. What are the main steps in a data science project?
Ans:
A typical data science project starts with understanding the problem. Then, data is collected, cleaned, and explored. Next, models are built using the data, and their performance is tested. Finally, the best model is deployed to solve the actual problem or help in decision-making.
4. How do you deal with missing values in data?
Ans:
You have two options for dealing with missing data either eliminate the rows or columns that include missing values or use techniques like the mean, median, or most frequent value to fill them in. The choice depends on how much data is missing and how important it is for the analysis.
5. How does supervised learning differ from unsupervised learning?
Ans:
Labeled data with known outcomes is used in the supervised learning to train a model. Its like teaching with answers. In unsupervised learning, the model tries to find patterns on its own using data without labels. It’s used to group or cluster similar data.
6. What does cross-validation mean in model testing?
Ans:
Cross-validation is a method used to check how well a model will perform on new data. The data is split into parts some parts are used to train the model, and the rest to test it. This helps make sure the model is not just working well on one specific dataset but is truly reliable.
7. What is overfitting in machine learning, and how can you prevent it?
Ans:
Overfitting happens when model learns the training data too well, including its noise or mistakes. It performs well on training data but poorly on new data. To avoid overfitting, we can use simpler models, more training data, cross-validation or regularization techniques.
8. What is a confusion matrix, and what does it show?
Ans:
A confusion matrix is a table used to show how well a classification model is working. It compares predicted values with actual values and includes four parts: true positives, true negatives, false positives, and false negatives. This helps us measure the accuracy and errors of the model.
9. How do you choose the most important features in a dataset?
Ans:
To select important features, we can use methods like correlation checks, feature importance from models like random forest, or statistical tests. Removing less important features helps make the model faster and more accurate.
10. How does the K-Nearest Neighbors (KNN) algorithm work?
Ans:
KNN is simple algorithm that classifies data based on its neighbors. When you want to predict a new data point, it looks at the ‘k’ closest known data points and assigns the most common class among them. Its based on the idea that similar things are found close to each other.