1. How is data science different from data analytics and what is it?
Ans:
Data science is a broad discipline combines statistics, machine learning and programming to extract insights and build predictive models from data. In comparison, data analytics focuses mainly on examining existing data, identifying trends and solving specific problems using descriptive statistics. While both fields deal with data, data science leans more toward prediction and advanced analysis, whereas data analytics is primarily interpretive.
2. What role does a company’s data scientist perform?
Ans:
A data scientist is essential to solving business problems by working with large and complex datasets. Their responsibilities include cleaning and preparing data, building predictive models and analyzing outcomes to provide valuable insights. They also use visualization tools to present their findings in a clear, actionable way, helping stakeholders make better decisions.
3. Describe how structured and unstructured data are different.
Ans:
A certain format is used to arrange structured data, often stored in databases with rows and columns such as SQL tables, making it easy to analyze. On the other hand, unstructured data lacks a fixed format and includes information like images, videos, social media posts, emails or PDFs. While structured data is straightforward to handle unstructured data requires advanced processing techniques for analysis.
4. What are a data science project's key steps?
Ans:
A data science project usually starts with defining the problem, collecting relevant data and preparing it for analysis. The next stage is exploratory data analysis (EDA) to identify relationships and patterns. After that, models are built, trained and evaluated for accuracy. Finally, the best model is deployed into production and monitored regularly to ensure accuracy and continuous improvement.
5. How is missing data in a dataset handled?
Ans:
Handling missing data is critical for accurate analysis. Basic methods include removing rows or columns with too many missing values or replacing them with statistical measures like mean, median or mode. For more reliable results advanced techniques such as KNN imputation or predictive modeling can be used to estimate the missing values more effectively.
6. How does supervised learning differ from unsupervised learning?
Ans:
Supervised learning works with labeled data, meaning the algorithm is trained using predefined inputs and outputs to predict results or classify data, such as in regression or classification tasks. In contrast, unsupervised learning uses unlabeled data in which the objective is to uncover obscure patterns, clusters or relationships without predefined outcomes, such as clustering or dimensionality reduction.
7. Describe how cross-validation is used in model evaluation.
Ans:
Cross-validation is a method used to test effectively a machine learning model works with unknown input. K fold cross-validation splits the dataset into several parts, with the model trained on some folds and tested on the remaining ones. This process is repeated multiple times reducing overfitting and providing a more reliable estimate of the model’s overall performance.
8. What is a confusion matrix? Explain its components.
Ans:
A table that compares categorization models is called a confusion matrix actual and predicted outcomes. There are four parts to it. False Negatives, True Negatives, False Positives and True Positives. These values help in calculating performance metrics such as accuracy, precision, recall and F1 score, offering deeper insights into how effectively the model classifies data.
9. How do you select important features in a dataset?
Ans:
Feature selection is an important process to improve model efficiency by using only the most relevant inputs. It can be done through filter methods, which apply statistical tests to rank features, wrapper methods such as recursive feature elimination that test feature combinations and embedded methods like Lasso regularization automatically select key features during model training.
10. Explain the working of the k-nearest neighbors (KNN) algorithm.
Ans:
The KNN algorithm is a simple yet effective method used for both tasks of regression and classification. It finds the "k" data points are closest to a new input based on a distance measure such as Euclidean distance. For classification, the new point is assigned to the class most common among its neighbors, while for regression, the prediction is made by averaging the values of the nearest neighbors.