1. What is a confusion matrix and why is it useful in classification tasks?
Ans:
A confusion matrix is a summary table that shows how well a classification model’s predictions match the actual labels. It breaks down results into true positives, true negatives, false positives and false negatives. From this table, important metrics such as precision, recall, accuracy and F1‑score can be calculated. Using a confusion matrix gives a clearer insight into where the model is doing well or making mistakes, beyond just overall accuracy.
2. How would missing or corrupted data in a dataset be handled before training a model?
Ans:
Before training, it’s important to clean the data missing or corrupted entries could distort model learning. One approach is to remove rows or columns that have too many missing values, while another approach is to fill in missing entries using statistical methods like mean, median or mode imputation. After that, normalization or encoding (for categorical fields) might be needed to make data ready for algorithms. Proper data cleaning leads to more reliable, accurate models.
3. What is the bias‑variance tradeoff and why does it matter in machine learning?
Ans:
The bias‑variance tradeoff refers to a balance between error due to erroneous assumptions (bias) and error due to sensitivity to small fluctuations in the training set (variance). A model with high bias may be too simple and underfit missing important patterns. A model with high variance may overfit capturing noise rather than general patterns, which hurts performance on new data. Striking the right balance ensures that the model generalizes well to unseen data, rather than just memorizing the training set.
4. When is it better to use a simpler algorithm rather than a complex model like a neural network?
Ans:
A simpler algorithm is often better when the dataset is small, the features are well-understood or interpretability is important. Simple models (like linear regression, decision trees or logistic regression) are easier to interpret, faster to train and less prone to overfitting if data is limited. For tasks where relationships are straightforward, using simpler methods avoids unnecessary complexity and often yields stable performance. In contrast, deep models should be reserved for problems needing sophisticated pattern recognition (e.g. images, text).
5. What is cross‑validation and how does it help in evaluating machine learning models?
Ans:
Cross‑validation is a technique used to estimate how a model will perform on unseen data by splitting the dataset into multiple subsets (folds). The model is trained on some folds and validated on the remaining ones and this process repeats across all folds. This helps check how stable and generalizable the model is, rather than relying on a single train/test split. It reduces the risk of overfitting and gives a more robust evaluation of model performance.
6. What is feature engineering and why is it important in machine learning workflows?
Ans:
Feature engineering involves creating new input variables or transforming existing ones to better represent the information needed by the model. This might include scaling/normalizing data, turning categorical variables into numeric form, generating interaction features or extracting meaningful attributes from raw data. Well-engineered features often improve model accuracy significantly. Even the best models will struggle if the input features don’t properly represent the underlying patterns.
7. What is overfitting and what strategies help to prevent it?
Ans:
Overfitting happens when a model learns the details and noise in training data too well, resulting in poor generalization to new data. To avoid overfitting, techniques such as regularization (adding penalty for complexity), limiting model complexity, using cross‑validation, adding more training data or applying dropout (for neural networks) can be used. These approaches help create models that perform well not only on training data but also on unseen real data.
8. When would you choose a tree‑based model (like decision tree or random forest) over linear regression for a problem?
Ans:
Tree‑based models are useful when relationships between features and target are non-linear or when there are complex interactions among features. Unlike linear regression which assumes a linear relationship, decision trees and ensembles (like random forests) can automatically capture non-linear patterns and interactions. They also handle mixed data types and missing values more robustly. Such models are often preferred when data is messy or pattern complexity is high.
9. What is regularization and how does it help in building better models?
Ans:
Regularization is a method that adds a penalty on model complexity during training discouraging overly complex models that might overfit. By constraining coefficients (in methods like L1 or L2 regularization), it reduces variance while slightly increasing bias, which often results in better performance on unseen data. Regularization helps in balancing model flexibility and generalization, leading to more robust outcomes across different datasets.
10. How to decide which machine learning algorithm to use for a given problem?
Ans:
Choosing an algorithm depends on several factors: whether data is labeled, the nature of the problem (classification vs regression vs clustering), size of dataset, computational resources available and whether interpretability is important. For simple, structured data with linear relationships, linear models may suffice. For complex data or non-linear relationships, tree‑based or neural network models may perform better. Correctly analyzing data and problem requirements helps in selecting the right algorithm for reliable performance.