1. How is data science different from data analytics and what is it?
Ans:
Data science is a broad field that uses statistics, machine learning and computer programming to extract insights from data and create predictive models. In contrast, data analytics focuses more on examining data to identify trends, summarize patterns and solve specific problems using descriptive statistics. While both deal with data, data science is more predictive and analytical, whereas data analytics is largely interpretive.
2. What role does a company’s data scientist perform?
Ans:
An essential function of a data scientist is to solving business problems by working with large and complex datasets. They are responsible for cleaning and preparing the data, developing predictive models, analyzing outcomes and communicating their findings to stakeholders. Often, they use data visualization tools to present actionable insights in a clear and impactful way.
3. Describe how structured and unstructured data are different.
Ans:
Structured data follows a specific format, typically stored in tables with rows and columns like in relational databases (e.g., SQL). Unstructured data, on the other hand, does not follow a predefined format and includes data types such as images, videos, PDFs, emails and social media posts. Structured data is easier to analyze, while unstructured data requires more complex processing.
4. What are a data science project's key steps?
Ans:
Next is EDA to understand patterns and relationships. After a model is built and evaluated for performance. Finally, the model is deployed into production and monitored for accuracy and improvement over time.
5. How is missing data in a dataset handled?
Ans:
Handling missing data is crucial for maintaining dataset accuracy. Common methods include removing rows or columns that have too many missing values or imputing missing values using statistical measures such as mean, median or mode. Advanced techniques like KNN imputation or predictive modeling also be used for better accuracy.
6. How does supervised learning differ from unsupervised learning?
Ans:
Supervised learning uses labeled datasets to train algorithms that can predict outcomes or classify data such as in regression or classification tasks. Unsupervised learning, however targets unmarked data and seeks to uncover latent groupings or patterns within it such as clustering or dimensionality reduction.
7. Describe how cross-validation is used in model evaluation.
Ans:
Cross-validation is a technique used to evaluate effectively a machine learning model works with unknown input. In methods like k-fold cross-validation the data is divided into multiple parts the model is trained on some parts and tested on the remaining ones. This process is repeated several times to reduce overfitting and ensure a more reliable estimate of the model's performance.
8. What is a confusion matrix? Explain its components.
Ans:
A confusion matrix is a table that assesses the effectiveness of classification models. It includes four componentsTrue Negatives (FN), False Positives (FP), True Positives (TP) and True Negatives (TN). These values help calculate metrics like accuracy, precision, recall and F1 score, which provide deeper insight into how well the model is predicting each class.
9. How do you select important features in a dataset?
Ans:
Feature selection helps improve model performance by choosing only the most relevant inputs. Common techniques include filter methods that use statistical tests to score features, wrapper methods like recursive feature elimination that test combinations of features and embedded methods such as Lasso regularization, which automatically selects features during model training.
10. Explain the working of the k-nearest neighbors (KNN) algorithm.
Ans:
A basic but effective technique for regression and classification is the KNN algorithm. Finding the 'k' closest data points is it operates to a new input based on a distance metric like Euclidean distance. For classification, it assigns the class most common among the neighbors; for regression, it averages the values of the neighbors to make a prediction.