1. What distinguishes data science from conventional data analysis?
Ans:
Data science is the process of drawing insightful conclusions from massive and complex datasets using advanced tools like Python, machine learning and statistics. Unlike traditional data analysis that mostly focuses on past trends and simple summaries, data science goes further by predicting future events and solving complex problems with algorithms and programming.
2. How does supervised learning differ from unsupervised learning?
Ans:
Supervised learning works with labeled data, where both inputs and correct outputs are known, similar to learning under a teacher guidance. In contrast unsupervised learning uses unlabeled data to uncover hidden patterns or group similar items, like finding friend groups based on shared interests without prior labels.
3. What is overfitting in machine learning and how can you prevent it?
Ans:
A model is said to be overfit if it learns the training data too well including noise and errors, leading to poor results on new data. To avoid this, techniques like cross-validation, simplifying the model or applying regularization are used to make the model generalize better.
4. Explain the bias-variance tradeoff.
Ans:
Bias refers to errors from incorrect assumptions in the model, causing underfitting, while variance comes from the model being too sensitive to small changes in training data, causing overfitting. A balanced model manages both bias and variance for optimal performance.
5. In terms of data science, what are the main distinctions between R and Python?
Ans:
Python is a flexible programming language is preferred by building data applications and machine learning models, widely used in industry. Python is a flexible programming language is preferred by and is often preferred in academic and research settings.
6. How should a dataset with missing data be handled?
Ans:
Missing data can be handled by removing incomplete records, filling missing values with averages or common values or using advanced techniques like interpolation or predictive modeling to estimate the missing parts.
7. Explain the concept of feature engineering.
Ans:
The process of developing new features is called feature engineering input features from existing data to improve model performance. This includes cleaning data, transforming variables and combining features to help the model learn more effectively.
8. What distinguishes a regression problem from a classification problem?
Ans:
Regression predicts continuous values, such as house prices or temperatures, while classification predicts categories, like spam or not spam. Both are types of supervised learning but focus on different kinds of predictions.
9. What is a confusion matrix in classification?
Ans:
A confusion matrix compares the actual and predicted results of a classification model. It breaks down outcomes into true positives, false positives, true negatives and false negatives to evaluate model accuracy.
10. What are precision and recall?
Ans:
Precision quantifies the proportion of anticipated positive cases that are actually correct, while recall measures many of the actual positive cases were identified correctly. Together they help assess a model’s effectiveness in finding relevant results.