1. What is machine learning classifier, and how does it function?
Ans:
A classifier is a model that categorizes input data into specific groups based on patterns learned from labeled training datasets. It predicts the class of new, unseen data by applying these learned patterns. For instance, an email filter can identify spam messages by analyzing previous examples and establishing decision rules.
2. How are bagging and boosting different in ensemble learning?
Ans:
Bagging, or bootstrap aggregation, generates multiple independent models using random subsets of the data and combines their predictions to reduce variance and enhance stability. Boosting, on the other hand, builds models sequentially, with each new model focusing on correcting mistakes from prior models, which reduces bias and often improves accuracy.
3. How does supervised learning differ from unsupervised learning?
Ans:
Supervised learning relies on labeled data to train models to predict outputs from given inputs. Unsupervised learning works with unlabeled data, aiming to identify hidden structures, clusters, or patterns without predefined outcomes. The choice depends on whether the task is predictive or exploratory.
4. Can you explain the bias-variance tradeoff in modeling?
Ans:
High bias occurs when a model is too simple and underfits, missing important patterns in the data. High variance happens when a model is too complex, capturing noise and performing poorly on new data. The goal is to find a balance, creating a model that generalizes well while accurately capturing underlying patterns.
5. What is a Support Vector Machine (SVM), and when should it be used?
Ans:
SVM is a classification algorithm that identifies the optimal separating boundary (hyperplane) between classes. For non-linear data, kernel functions map data to higher dimensions for better separation. SVM is effective in tasks with clear or complex decision boundaries and works well for small to medium-sized datasets.
6. What does overfitting mean, and how can it be avoided?
Ans:
Overfitting happens when a model memorizes training data, including noise, and performs poorly on new data. It can be mitigated by simplifying the model, applying regularization (L1/L2), using cross-validation, collecting more training data, or stopping training early once performance on validation data plateaus.
7. Which programming languages or libraries are commonly used for AI/ML, and why?
Ans:
Python is widely favored due to its readability and extensive ecosystem. Libraries such as Pandas and NumPy handle data operations, scikit-learn offers classical ML algorithms, and TensorFlow or PyTorch support deep learning. Together, they simplify preprocessing, model building, evaluation, and deployment.
8. What is a confusion matrix, and why is it important?
Ans:
A confusion matrix compares predicted labels against actual labels in classification tasks. It records true positives, true negatives, false positives, and false negatives, allowing calculation of metrics like accuracy, precision, recall, and F1-score. This helps evaluate both model performance and the types of errors it makes.
9. How would you manage missing or corrupted data in a dataset?
Ans:
Missing or corrupted data can be handled by removing affected rows/columns, imputing values with mean, median, or mode, or using techniques like KNN-based imputation. Additionally, scaling, normalization, and encoding categorical features are performed to prepare a clean and consistent dataset for modeling.
10. What factors influence the choice of a machine learning algorithm?
Ans:
Algorithm selection depends on the type of data (labeled/unlabeled), the problem (classification, regression, clustering), dataset size, computational resources, and model interpretability. Simple algorithms like decision trees work well for small structured datasets, while deep learning models are suitable for complex data such as images or text.