1. What is a confusion matrix and why is it important in evaluating classifiers?
Ans:
A confusion matrix is a table that summarizes how a classification model’s predictions compare to actual outcomes. It separates results into true positives, true negatives, false positives, and false negatives. From these values, you can calculate metrics like accuracy, precision, recall, and F1-score, which provide a detailed view of model performance beyond overall correctness.
2. How should missing or invalid data be handled before training a model?
Ans:
Before feeding data to a model, missing or corrupted values must be addressed to avoid biased or incorrect learning. Options include removing rows or columns with excessive missing values or filling gaps using statistical imputation methods such as mean, median, or mode. After cleaning, features may need to be scaled or converted to numeric formats to ensure proper processing.
3. What does the bias-variance tradeoff mean and why is it significant?
Ans:
The bias-variance tradeoff describes the balance between underfitting and overfitting. High bias occurs when a model is too simple to capture patterns in the data, leading to underfitting. High variance arises when a model is too sensitive to training data, capturing noise instead of general patterns, resulting in overfitting. Balancing bias and variance ensures the model generalizes well to new, unseen data.
4. When is it preferable to use a simpler algorithm instead of a complex model like a neural network?
Ans:
Simpler algorithms are ideal for small datasets, well-understood features, or situations where interpretability is crucial. Models like linear regression, logistic regression, or basic decision trees are easier to train, faster to run, and less prone to overfitting. Complex models, such as deep neural networks, are better suited for tasks involving large datasets or complicated patterns, such as images or natural language.
5. What is cross-validation and how does it improve model evaluation?
Ans:
Cross-validation is a method for estimating a model’s ability to generalize by splitting the data into multiple folds. The model is trained on some folds and tested on the remaining ones, repeating the process so each fold is used for validation. This approach provides a more reliable measure of performance and reduces the likelihood of overfitting compared to a single train-test split.
6. What is feature engineering and why is it important?
Ans:
Feature engineering involves creating new features or transforming existing ones to make them more informative for the model. This can include normalizing values, converting categories into numerical form, creating interaction terms, or extracting meaningful attributes from raw data. Well-engineered features often improve model accuracy and effectiveness more than tweaking algorithms alone.
7. What is overfitting, and which methods help prevent it?
Ans:
Overfitting occurs when a model captures noise and details specific to the training data, reducing its ability to generalize to new data. Strategies to avoid overfitting include limiting model complexity, applying regularization (e.g., L1 or L2 penalties), using cross-validation, adding more data, or employing dropout in neural networks.
8. When would you select a tree-based model over linear regression?
Ans:
Tree-based models, like decision trees or random forests, are useful when feature-target relationships are non-linear or involve complex interactions. They handle categorical data and missing values robustly, unlike linear regression which assumes a straight-line relationship. Tree-based models are preferred when data patterns are intricate or non-linear.
9. How does regularization help improve model performance?
Ans:
Regularization adds a penalty for model complexity during training, discouraging overly complex models that might overfit. Techniques like L1 (Lasso) and L2 (Ridge) reduce variance while slightly increasing bias, leading to better performance on unseen data. Regularization balances flexibility with generalization.
10. How do you choose the most suitable ML algorithm for a task?
Ans:
Selecting an algorithm depends on factors such as whether data is labeled, the type of problem (classification, regression, clustering), dataset size, available computational resources, and the need for interpretability. Simple linear models work for straightforward relationships, while tree-based or neural network models excel with complex or large datasets. Understanding data and goals ensures the best algorithm choice.