1. What is a classifier in machine learning, and how does it function?
Ans:
A classifier is an algorithm that assigns input data to one of several predefined categories. It learns from labeled examples, identifying patterns and relationships between inputs and their corresponding outputs. Once trained, it can predict the class of new, unseen instances. For example, a classifier in an email system can distinguish spam messages from legitimate ones by analyzing features learned from past data. This process enables automated, data-driven decision-making.
2. How do bagging and boosting differ in ensemble learning?
Ans:
Bagging and boosting are both ensemble approaches but operate differently:
Bagging (Bootstrap Aggregating):
- Trains multiple independent models on different random subsets of the data
- Combines predictions using averaging or voting
- Reduces variance and stabilizes results
Boosting:
- Builds models sequentially, each learning from the mistakes of its predecessor
- Focuses on difficult cases to reduce bias
- Often improves accuracy but can risk overfitting if unchecked
3. How does supervised learning differ from unsupervised learning?
Ans:
Supervised learning uses datasets with labeled outputs, allowing the model to learn the relationship between input features and known results. This makes it suitable for prediction and classification tasks. Unsupervised learning, in contrast, works with unlabeled data and seeks to discover inherent patterns, such as grouping similar items together or reducing the number of dimensions for easier analysis. The choice depends on whether labeled data is available and whether the goal is prediction or pattern discovery.
4. Can you explain the bias-variance tradeoff in machine learning?
Ans:
The bias-variance tradeoff is about balancing two types of errors in a model:
- High bias: The model is too simple → underfits → fails to capture the underlying patterns
- High variance: The model is too complex → overfits → learns noise instead of general trends
- Goal: Find a middle ground where the model is complex enough to capture true patterns but generalizes well to unseen data
5. How does a Support Vector Machine (SVM) work, and when is it useful?
Ans:
A Support Vector Machine finds an optimal hyperplane that separates data points from different classes with the maximum margin. For data that is not linearly separable, SVM applies kernel functions to map the data into higher dimensions, where a separating hyperplane can be found. It is particularly effective in classification tasks, especially when classes are distinct but not perfectly linearly separable, and when robust performance is needed with relatively small datasets.
6. What is overfitting in machine learning, and how can it be avoided?
Ans:
Overfitting occurs when a model captures not just the underlying patterns but also the noise in the training data, leading to poor performance on new data. Common ways to prevent overfitting include:
- Simplifying the model architecture
- Applying regularization (L1 or L2)
- Using cross-validation
- Gathering more training data
- Early stopping during model training
7. Which programming languages or libraries are most commonly used for machine learning, and why?
Ans:
Python is the most widely used language due to its readability and extensive ecosystem for data science. Libraries like Pandas and NumPy support data manipulation and numerical operations. scikit-learn provides easy access to classical ML algorithms such as regression, classification, and clustering. TensorFlow and PyTorch are popular for deep learning and neural networks. This combination allows for efficient data preprocessing, model building, evaluation, and deployment.
8. What is the role of a confusion matrix in classification model evaluation?
Ans:
A confusion matrix helps assess how well a classification model performs by comparing predicted labels with actual labels. It includes:
- True Positives (TP): Correctly predicted positives
- True Negatives (TN): Correctly predicted negatives
- False Positives (FP): Incorrectly predicted positives
- False Negatives (FN): Incorrectly predicted negatives
9. How would you manage missing or corrupted data before training a model?
Ans:
Handling missing or corrupted data involves cleaning and preparing the dataset for modeling. Strategies include removing rows or columns with excessive missing values, imputing missing entries using mean, median, or mode, or applying advanced techniques like K-Nearest Neighbors (KNN) imputation. Additionally, normalization, scaling, and encoding categorical features may be necessary. Thorough preprocessing ensures the model receives consistent and meaningful inputs for accurate learning.
10. What factors guide your selection of a machine learning algorithm for a problem?
Ans:
Choosing an algorithm depends on several factors:
- Data type: Labeled vs unlabeled
- Problem type: Classification, regression, clustering, etc.
- Dataset size and dimensionality
- Computational resources available
- Need for interpretability vs accuracy
- Nature of relationships in data: Linear or non-linear