1.How is data science different from data analytics, and what is it?
Ans:
Data science involves extracting insights and building predictive models using techniques from statistics, machine learning, and computer science. Data analytics focuses more on examining datasets to find trends and solve problems, often using descriptive statistics. Data science is broader and more predictive in nature.
2. What role does a company's data scientist perform?
Ans:
A data scientist builds models to solve business problems, analyzes large datasets, cleans and prepares data, and communicates findings to stakeholders using data visualizations and reports.
3. Describe how structured and unstructured data are different.
Ans:
Rows and columns are used to arrange structured data (e.g., SQL databases). Unstructured data includes formats like images, videos, emails, and social media posts, which don’t follow a fixed schema.
4. What are a data science project's key steps?
Ans:
- Data collection
- Data cleaning and preprocessing
- Exploratory data analysis (EDA)
- Model building
- Model evaluation
5. How is missing data in a dataset handled?
Ans:
- Removing rows/columns with missing values
- Imputing values using mean, median, or mode
- Using advanced methods like KNN imputation or regression models
6. Describe how cross-validation is used in model evaluation.
Ans:
Cross validation splits data into training and validation sets multiple times (e.g., k-fold), helping to assess model performance more reliably and reduce overfitting.
7. What is Cross-Origin Resource Sharing (CORS)?
Ans:
- Supervised learning: Labeled data is used to training the models (e.g., classification, regression).
- Unsupervised learning: No labels; the model finds patterns (e.g., clustering, dimensionality reduction).
8. What is a confusion matrix? Explain its components.
Ans:
A confusion matrix evaluates classification models by showing:
- TP (True Positive)
- TN (True Negative)
- FP (False Positive)
- FN (False Negative)
9. How do you select important features in a dataset?
Ans:
- Filter methods (e.g., correlation)
- Wrapper methods (e.g., recursive feature elimination)
- Embedded methods (e.g., Lasso regularization)
10. Explain the working of the k-nearest neighbors (KNN) algorithm.
Ans:
KNN classifies a data point based on the 'k' closest neighbors in the training set. It uses distance metrics (like Euclidean) to find these neighbors and predicts the class based on majority vote (for classification) or average (for regression).
11. How does the decision tree algorithm work?
Ans:
It splits data into branches based on feature values that result in the highest information gain or lowest Gini impurity. This continues recursively until terminal nodes (leaves) are reached.
12. Explain Support Vector Machines (SVM) and their applications.
Ans:
SVM finds the optimal hyperplane that best separates classes in the feature space. It's useful for text classification, face detection, and bioinformatics due to its effectiveness in high-dimensional spaces.
13. Describe the working of the Naive Bayes algorithm.
Ans:
Based on Bayes' theorem, Naive Bayes is a probabilistic classifier assuming feature independence. Each class's posterior probability is determined, and the class with the highest probability is predicted.
14. Explain k-means clustering and its use cases.
Ans:
By minimizing within-cluster variance, K-means divides data into k clusters according to feature similarity. Market segmentation, picture compression, and consumer segmentation all make use of it.
15. What is a neural network? How does it work?
Ans:
Layers of connected nodes, or neurons, make up a neural network. Each neuron applies a weighted sum and activation function to input data. Neural networks learn by adjusting weights using backpropagation.
16. What are ensemble methods in machine learning?
Ans:
Several models are combined using ensemble methods to improve prediction.
- Bagging (e.g., Random Forest)
- Boosting (e.g., XGBoost, AdaBoost)
17. How do you handle outliers in your dataset?
Ans:
- Removing them
- Transforming variables (e.g., log scaling)
- Capping values (winsorizing)
- Using robust models (like decision trees)
18. What techniques do you use for feature scaling?
Ans:
- Normalization (Min-Max Scaling): Scales features to [0,1]
- Standardization (Z-score): Centers around mean = 0 and std = 1
- Robust scaling: Uses median and IQR, useful for outliers
19. What is one-hot encoding, and when do you use it?
Ans:
It converts categorical variables into binary columns. Used when machine learning models require numerical input, e.g., converting "Red", "Blue", "Green" into [1,0,0], [0,1,0], [0,0,1].
20. What is feature selection, and why is it important?
Ans:
The process of feature selection determines which features are most pertinent to model training. It cuts down on overfitting, shortens training times, and enhances model performance.