1. What is Data Science?
Ans:
Data Science is the practice of examining datato recognize trends, resolve issues and reach well-informed conclusions. It blends computer science, statistics, mathematics and domain knowledge to convert raw information into actionable insights that guide better business strategies.
2. What constitutes data science's essential elements?
Ans:
The main components include collecting data from diverse sources, cleaning it to fix errors or missing values, analyzing it to discover trends, building predictive models using algorithms and interpreting the results for practical decision-making. Together, these steps create effective, data-driven solutions.
3. What is a confusion matrix?
Ans:
A confusion matrix is a table evaluates the performance of a machine learning model. It compares predicted results with actual outcomes, showing False negatives, false positives, true positives and true negatives. This helps understand the accuracy of predictions and the types of errors made.
4. Which metrics are used to evaluate model performance?
Ans:
Common evaluation metrics include accuracy which measures the overall correctness of predictions precision, which shows the proportion of correct positive predictions recall, which assesses how many actual positives were identified and F1-score, which balances precision and recall. ROC-AUC is also used to test a model’s ability to distinguish between classes.
5. What is feature engineering?
Ans:
The process of feature engineering involves developing or modifying input variables to improve a model predictive performance. It involves selecting relevant features, transforming data or combining variables. Proper feature engineering can significantly enhance model accuracy and efficiency.
6. How is missing data handled?
Ans:
Missing data can be addressed by removing rows or columns with excessive gaps, filling missing values using mean, median or mode, employing algorithms that handle missing data automatically or predicting missing entries based on other available information. The goal is to maintain dataset integrity for reliable analysis.
7. What is overfitting and how can it be prevented?
Ans:
Overfitting occurs when a model learns not only the patterns but also the noise in training data, which reduces its performance on new data. Preventive measures include using simpler models, applying cross-validation, adding regularization techniques or increasing the size of the training dataset to improve generalization.
8. What is a random forest and how does it function?
Ans:
Random forest is a machine learning technique combines multiple decision trees to make predictions. A random subset of data is used to train each tree and the ultimate prediction is produced by aggregating outputs from all trees. This method improves accuracy and is less prone to overfitting than a single decision tree.
9. What are the steps in a Data Science workflow?
Ans:
A typical workflow starts with defining the problem, collecting relevant data and cleaning it for accuracy. Next, data is explored and analyzed to extract insights, followed by model building, training and evaluation. Finally, models are deployed and continuously monitored to ensure consistent performance.
10. How is data quality ensured?
Ans:
Data quality is maintained by removing duplicates, correcting errors, filling missing values, standardizing formats and validating data sources. Reliable, high-quality data is necessary for precise analysis and forms the foundation of successful Data Science projects.