1. Can you explain what One-Hot Encoding means?
Ans:
A method for transforming categorical data into numerical format that machine learning models can be understand is called one-hot encoding. With one value set to 1 ("hot") and the others to 0 ("cold"), it depicts each category as a binary vector. For instance, [1,0,0], [0,1,0] and [0,0,1] would be the encoded values for a "color" variable that has categories such as red, blue and green. This facilitates the efficient interpretation of non-numerical data by algorithms.
2. How does Lemmatization differ from Stemming?
Ans:
Although lemmatization and stemming both reduce words to their most basic forms, their methods and levels of precision vary. Lemmatization uses linguistic rules and vocabulary to find the proper dictionary form of a word, ensuring grammatical correctness. Stemming, however removes prefixes or suffixes without context which can create non-existent words. For instance, “better” becomes “good” through lemmatization but may reduce to “bet” through stemming.
3. What does Conditional Probability mean in simple terms?
Ans:
The probability that one event will occur provided that another event have already occurred is known as conditional probability. It is mathematically expressed as P(A|B) = P(A and B) / P(B). This concept is crucial in data science and machine learning, as it helps models predict outcomes based on dependent conditions such as determining the probability of rain given the presence of clouds.
4. What is meant by overfitting in machine learning models?
Ans:
Overfitting is when a model learns noise and random fluctuations in addition to the underlying patterns, becoming overly customized to its training data. This leads to poor performance on new or unseen data. To overcome overfitting, techniques like cross-validation, dropout, pruning and regularization are applied to ensure that the model generalizes well to real-world situations.
5. How can missing data in a dataset be handled effectively?
Ans:
Dealing with missing data involves several strategies depending on the type and amount of missing values. Common methods include imputing missing entries using the mean, median or mode or using predictive models to estimate them. If the missing portion is minimal, rows or columns can be removed altogether. Choosing the right approach helps maintain the integrity of the dataset and prevents bias in analysis.
6. What is the trade-off between Precision and Recall?
Ans:
Two important criteria for evaluating the effectiveness of classification models are precision and recall. Precision measures how many predicted positive results are actually correct, while recall measures how well the model identifies all actual positives. Improving one often decreases the other, so finding the right balance depends on the problem for example, prioritizing recall in medical diagnoses and precision in spam detection.
7. How is XGBoost different from Random Forest?
Ans:
XGBoost and Random Forest are both ensemble learning algorithms, but they differ in how they combine decision trees. Random Forest builds multiple trees independently and averages their results, reducing variance and avoiding overfitting. XGBoost builds trees sequentially, where each new tree corrects errors from previous ones using gradient boosting. This makes XGBoost faster and often more accurate, especially for structured data tasks.
8. Can you describe a project that involved a machine learning model implementation?
Ans:
To create a collaborative filtering recommendation system for an e-commerce site, a machine learning model was created. In order to recommend customized products, the system examined user interactions. The model was assessed using precision and recall measures to guarantee trustworthy recommendations and matrix factorization techniques were used to increase prediction accuracy.
9. What differentiates supervised learning from unsupervised learning?
Ans:
Predicting sales numbers from historical data is an example of supervised learning, which trains models using labeled data, where the input and intended output are known. Unsupervised learning, on the other hand, finds latent structures or groupings in unlabeled data without the need for predetermined outputs. While k-means and PCA are examples of unsupervised techniques, supervised learning techniques include neural networks and linear regression.
10. How can categorical variables with many unique values be encoded?
Ans:
Encoding high-cardinality categorical variables requires efficient techniques to prevent excessive complexity. One method is target encoding, in which the mean of the target variable for each category is used to replace the categories. Alternatively, one-hot encoding followed by dimensionality reduction methods like PCA can simplify data representation. The goal is to balance detail retention with computational efficiency while avoiding overfitting.