1. What is the role of a data scientist in a company?
Ans:
A data scientist helps organizations make data-driven decisions by collecting data, finding patterns, building predictive models and delivering insights that guide various business teams.
2. How do structured and unstructured data differ?
Ans:
Structured data is neatly organized in tables with rows and columns, like databases or spreadsheets. Unstructured data includes formats such as emails, videos, images and text that don’t have a predefined format.
3. What are the main phases of a data science project?
Ans:
A typical project includes understanding the problem, collecting and cleaning data, exploring and analyzing it, developing a model, validating it and finally presenting the results to stakeholders.
4. How should missing values in a dataset be handled?
Ans:
Missing data can be managed by removing incomplete rows, filling gaps with averages or most common values or using algorithms that can automatically handles the missing information.
5. How does supervised learning differ from unsupervised learning?
Ans:
Supervised learning uses labeled data with known outputs like categories or values to train models. Unsupervised learning works on unlabeled data to discover hidden patterns or groups within the data.
6. What is cross-validation and why is it used?
Ans:
Cross-validation tests a model’s performance by splitting data into parts and training/testing the model multiple times. This makes it more likely that the model will function successfully with fresh, untested data.
7. What is overfitting and how can it be prevented?
Ans:
A model becomes less accurate on fresh data when it overfits, which happens when it learns the training data too well including noise. It can be avoided by simplifying the model, adding more data or applying regularization techniques.
8. What does a confusion matrix show?
Ans:
By displaying the numbers of true positives, false positives, true negatives and false negatives, a confusion matrix is a table that provides an overview of a classification model's performance and aids in accuracy evaluation.
9. How can you determine the most important features in a dataset?
Ans:
Key features can be identified using correlation analysis, feature importance scores from models like Random Forest or by testing the effect of removing features on model performance.
10. How does the K-Nearest Neighbors (KNN) algorithm work?
Ans:
By considering 'K' closest data points in training set and calculating a majority vote or average based on distance measurements, KNN predicts the label or value of a new data point.