1. What is Data Science, and how does it differ from Data Analytics?
Ans:
Data Science is the process of collecting, analyzing, and interpreting large volumes of data using various tools, algorithms, and techniques. While Data Analytics focuses on interpreting existing data to identify patterns and solve problems, Data Science is broader and includes analytics, machine learning, and predictive modeling.
2. What does a Data Scientist do in a company?
Ans:
A Data Scientist analyzes complex data, builds models, and uncovers patterns to help businesses make data-driven decisions. They play a key role in solving real-world business challenges using data.
3. What is the difference between structured and unstructured data?
Ans:
Structured data is organized in rows and columns, like spreadsheets or relational databases. Unstructured data has no predefined format and includes files like emails, images, videos, and social media posts.
4. What are the key steps in a Data Science project?
Ans:
- Understanding the problem
- Collecting data
- Cleaning data
- Analyzing it
- Building models
- Interpreting the results
5. How do you deal with missing values in data?
Ans:
You can remove rows with missing values, fill them with the average or most common value, or predict them using other data.
6. What is the difference between Supervised and Unsupervised Learning?
Ans:
Supervised Learning uses labeled data (with known outcomes) to train models, while Unsupervised Learning identifies patterns in data without labeled outputs.
7. What is Cross-Validation in machine learning?
Ans:
Cross-validation is a technique to evaluate model performance by dividing the dataset into parts, training on some parts, and testing on the remaining parts to ensure the model generalizes well.
8. What is a Confusion Matrix?
Ans:
A confusion matrix evaluates classification models by displaying true positives, true negatives, false positives, and false negatives. It helps assess accuracy and error types.
9. How do you select important features for a model?
Ans:
Feature selection techniques include correlation analysis, model-based importance scores (like in Random Forest), and recursive feature elimination to identify which features impact performance most.
10. How does the K-Nearest Neighbors (KNN) algorithm work?
Ans:
KNN identifies the 'k' closest data points to a new instance and predicts its value based on the majority class (for classification) or average value (for regression) of its neighbors.
11. How do decision trees work?
Ans:
Decision Trees split the dataset into branches based on feature-based questions. Each branch represents a decision rule, and the process continues until a final prediction is made at the leaf node.
12. What is SVM (Support Vector Machine), and where is it used?
Ans:
SVM is a classification algorithm that finds the optimal boundary (hyperplane) between different classes. It’s widely used in applications like image classification and spam detection.
13. How does Naive Bayes work?
Ans:
Naive Bayes is a probabilistic classifier that predicts outcomes based on prior probabilities and assumes feature independence. It’s fast and works well with text classification.
14. What is k-means clustering used for?
Ans:
K-Means is an unsupervised algorithm that groups data points into clusters based on similarity. It's commonly used in customer segmentation, behavior analysis, and pattern detection.
15. Describe the neural network.
Ans:
A Neural Network is inspired by the human brain and consists of layers of interconnected nodes. It processes data in layers and is used in complex tasks like image recognition, speech processing, and deep learning.
16. What are Ensemble Methods?
Ans:
Ensemble methods combine multiple machine learning models to improve accuracy. Popular examples include Random Forest and Gradient Boosting, which reduce variance and bias.
17. How do you handle outliers in a dataset?
Ans:
Outliers can be managed by:
- Removing them if they’re errors
- Applying transformations
- Analyzing them separately if they provide meaningful insights
18. How can you scale features in a dataset?
Ans:
Feature scaling techniques include:
- Normalization (Min-Max Scaling): Scales values between 0 and 1
- Standardization (Z-score): Centers data around mean with unit variance
19. What is One-Hot Encoding?
Ans:
One-Hot Encoding converts categorical variables into a binary format (0s and 1s), enabling machine learning models to process non-numeric data efficiently.
20.Why is Feature Selection important?
Ans:
Feature selection improves model performance by removing irrelevant or redundant data. It enhances accuracy, speeds up training time, and prevents overfitting.