25+ [MUST- KNOW] Data Science Interview Questions & Answers

25+ [MUST- KNOW] Data Science Interview Questions & Answers

Last updated on 08th Jun 2020, Blog, Interview Questions

About author

Gopinath (Sr Data Science Manager )

He is a Proficient Technical Expert for Respective Industry Domain & Serving 8+ Years. Also, Dedicated to Imparts the Informative Knowledge's to Freshers. He Share's this Blogs for us.

(5.0) | 16547 Ratings 767

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. It is related to data mining, deep learning and big data.The goal of data science is to gain insights and knowledge from any type of data both structured and unstructured.

1. What are the types of machine learning?

Ans:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement Learning

2. What is Supervised learning in machine learning?

Ans:

Supervised learning:  When you know your target variable for the problem statement, it becomes Supervised learning. This can be applied to perform regression and classification.

Example: Linear Regression and Logistic Regression.

3. What is Unsupervised learning in machine learning?

Ans:

Unsupervised learning : When you do not know your target variable for the problem statement, it becomes Unsupervised learning. This is widely used to perform Clustering.

Example: K-Means and Hierarchical clustering.

4. What are the commonly used python packages?

Ans:

  • Numpy
  • Pandas
  • SCI-KIT Learn
  • Matplot library

5. What are the commonly used R packages?

Ans:

  • Caret
  • Data.Table
  • Reshape
  • Reshape2
  • E1071
  • DMwR
  • Dplyr
  • Lubridate

6. Name the commonly used algorithms.

Ans:

  • Linear regression
  • Logistic regression
  • Random Forest
  • KNN

7. What is precision?

Ans:

The ratio of predicted positive against the actual positive. It is the most commonly used error metric as a classification mechanism.The range is from 0 to 1, where 1 represents 100%.

8. What is recall?

Ans:

The ratio of the true positive rate against the actual positive rate.The range is again from 0 to 1

9. Which metric acts like accuracy in classification problem statements?

Ans:

  •  F1 Score:  2 * (Precision*Recall)/Precision + Recall

10. What is a normal distribution?

Ans:

 When the data distribution is equally distributed as such the mean, median and mode are equal.

11. What is overfitting?

Ans:

Any prediction rate which has high inconsistency between the training error and the test error leads to a high business problem, if the error rate in the training set is low and the error rate in the test set is high, then we can conclude it as an over fitting model.

12. What is underfitting?

Ans:

Any prediction rate which has provided low prediction in the training error and the test error leads to a high business problem, if the error rate in the training set is high and the error rate inthe test set is also high, then we can conclude it as an overfitting model.

13. What is a univariate analysis?

Ans:

An Analysis that can be applied to one attribute at a time is called as a uni variate analysis. Boxplot is one of the widely used uni variate models. Scatter plot and cook’s distance are other methods used for bi variate and multivariate analysis.

14. Name a few methods for Missing Value Treatments.

Ans:

  • Central Imputation: This method acts more like central tendencies. All the missing values will be filed with mean and median mode respective to numerical and categorical data types.
  • KNN: K Nearest Neighbour imputation.
  • Distance between two or multiple attributes are calculated using Euclidian’s distance and the same will be used to treat the missing values. Mean and mode will agaibe n used as in CI.

15. What is the Pearson correlation?

Ans:

Correlation between predicted and actual data can be examined and understood using this method:

  • The range is from -1 to +1.
  • -1 refers to negative 100% whereas +1 refers to positive 100%.
  • The formula is Sd(x)*m/Sd.(y)

16. How and by what methods data visualizations can be effectively used?

Ans:

In addition to giving insights in a very effective and efficient manner, data visualization can also be used in such a way that it is not only restricted to bar, line or some stereotypic graphs. Data can be represented in a much more visually pleasing manner.

One thing has to be taken care of is to convey the intended insight or finding correctly to the audience. Once the baseline is set. Innovative and creative part can help you come up with better looking and functional dashboards. There is a fine line between the simple insightful dashboard and awesome looking 0 fruitful insight dashboards.

17. How to understand the problems faced during data analysis?

Ans:

Most of the problems faced during hands on analysis or data science is because of poor understanding of the problem in hand and concentrating more on tools, end results and other aspects of the project.

Breaking the problem down to a granular level and understanding takes a lot of time and practice to master. Coming back to square one in data science projects can be seen in a lot of companies and even in your own project or kaggle problems.

18. Advantages of Tableau Prep?

Ans:

Tableau Prep will reduce a lot of time like how its parent software (Tableau) does when creating impressive visualizations. The tool has a lot of potential in taking professionals from data cleaning, merging steps to creating final usable data that can be linked to Tableau desktop for getting visualization and business insights. A lot of manual tasks will be reduced and the time can be used to make better findings and insights.

19. What is the common perception about visualization?

Ans:

 People think visualization is just charts and summary information. But they are beyond that and drive business with a lot of underlying principles. Learning design principles can help anyone build effective and efficient visualizations and this Tableau prep tool can drastically increase our time on focusing on more important parts. The only issue with Tableau is, it is paid and companies need to pay for leveraging that awesome tool.

20. What are the time series algorithms?

Ans:

 Time series algorithms like ARIMA, ARIMAX, SARIMA, Holts winters are very interesting to learn and use as well to solve a lot of complex problems for businesses. Data preparation for time series analysis plays a vital role. The stationarity, seasonality, cycles and noises need time and attention. Take as much time as you would like to make the data right. Then you can run any model on top of it.

    Subscribe For Free Demo

    21. How to choose the right chart in case of creating a Visualization ?

    Ans:

     Using the right chart to represent data is one of the key aspects of data visualization and design principle. You will always have options to choose from when deciding on a chart. But fixing to the right chart comes only by experience, practice and deep understanding of end-user needs. That dictates everything in the dashboard.

    22. Where to seek help in case of discrepancies in Tableau?

    Ans:

    When you face any issue regarding Tableau, try searching in the Tableau community forum. It is one of the best places to get your queries answered. You can always write your question and get the query answered within an hour or a day. You can always post on LinkedIn and follow people.

    23. Now companies are heavily investing their money and time to make the dashboards. Why?

    Ans:

    To make stakeholders more aware about the business through data. Working on visualization projects helps you develop one of the key skills every data scientist should possess i.e. Thinking from the shoes of the end user.

    If you’re learning any visualization tool, download a dataset from kaggle. Building charts and graphs for the dashboard should be the last step. Research more about the domain and think about the KPIs you would like to see in the dashboard if you’re going to be the end user. Then start building the dashboard piece by piece.

    24. How can I achieve accuracy in the first model that I built?

    Ans:

     Building machine learning models involves a lot of interesting steps. 90% accuracy models don’t come in the very first attempt. You have done a lot of better feature selection techniques to get that point, which means it involves a lot of trial and error. The process will help you learn new concepts in statistics, math and probability.

    25. What is the basic responsibility of a Data Scientist?

    Ans:

    As a data scientist, we have the responsibility to make complex things simple enough that anyone without context should understand what we are trying to convey.

    The moment, we start explaining even the simple things the mission of making the complex simple goes away. This happens a lot when we are doing data visualization. Less is more. Rather than pushing too much information to the reader’s brain, we need to figure out how easily we can help them consume a dashboard or a chart. The process is simple to say but difficult to implement. You must bring the complex business value out of a self-explanatory chart. It’s a skill every data scientist should strive towards and good to have in their arsenal.

    26. How do I enhance a SAS analyst?

    Ans:

    • Step 1: Earn a College Degree. Businesses prefer SAS programmers who have completed a statistics or computer science bachelor’s degree program.
    • Step 2: Acquire SAS Certification.
    • Step 3: Consider Getting an Advanced Degree.
    • Step 4: Gain SAS Program Coding Work Background.

    27. What does SAS stand out to be the best over other data analytics tools?

    Ans:

    • Ease to understand: The provisions included in SAS are remarkably easy to learn. Further, it offers the most suitable option for those who already are aware of the SQL. On the other hand, R comes with a steep training cover which is supposed to be a low-level programming style.
    • Data Handling Capacities: it is at par the most leading tool which also includes the R& Python. If it advances before handling the huge data, it is the best platform to engage.
    • Graphical Capacities: it comes with functional graphical capacities and has a limited knowledge field. It is useful to customize the plots.
    • Better tool management: It benefits in a release the updates with regards to the controlled conditions.This is the main reason why it is well tested. Whereas if you considered R&Python, it has open contribution also the risk of errors in the current development is also high.

    28. What is RUN-Group processing?

    Ans:

    To practice RUN-group processing, you start the system and then submit many RUNgroups.

    A RUN-group is a group of records that contain at least one product group including ends with a RUN statement. It can contain different SAS statements such as AXIS, BY, GOPTIONS, LEGEND, Power, or WHERE.

    29. Definitions of is BY-Group processing?

    Ans:

    Definitions for BY-Group Processing. is a method of preparing observations from one or numerous SAS data sets that are arranged or ordered by importance of individual or more shared variables. All data sets that are being connected must include one or more BY variables.

    30. What is the right way to validate the SAS program?

    Ans:

    The OPTIONS OBS=0 through the commencement of the code needs to be written but if yourself require to perform the same then their mind be any log which gets recognized by the colors that get highlighted.

    31. Do you know any SAS functions and Call Routines?

    Ans:

    Can be a mutable type, uniform, or any SAS expression, including different use. This product also a letter from contentions that SAS allows are called for special purposes. Multiple arguments are separated with a comma.

    32. What is meant by precision and Recall?

    Ans:

    • Recall: It is known as a true real rate. The number of positives that your model has claimed related to the original defined number of positives available during this data.
    • Precision: It is also known as a positive predicted value. This is more based on the prediction. That indicates a time like a number of accurate positives that the model needs when compared to the number of positives it actually claims.

    33. What is deep learning?

    Ans:

     Deep learning is a process where it is considered to be a subset of machine learning.

    34. What is the F1 score?

    Ans:

    The F1 score is defined as a measure of a model’s performance.

    35. How is F1 score used?

    Ans:

    The average of Precision and Recall of a model is nothing but F1 score measure. Based on the results, the F1 score is 1 then it is classified as best and 0 being the worst

    Course Curriculum

    Advance your Career with Data Science Training By World Class Faculty

    • Instructor-led Sessions
    • Real-life Case Studies
    • Assignments
    Explore Curriculum

    36. What is root cause analysis?

    Ans:

     “All of us dread that meeting where the boss asks ‘why is revenue down?’ The only thing worse than that question is not having any answers! There are many changes happening in your business every day, and often you will want to understand exactly what is driving a given change -especially if it is unexpected. Understanding the underlying causes of change is known as root cause analysis.”

    37. What are confounding variables?

    Ans:

    These are obvious variables in a scientific model that correlates directly or inversely with both the subject and the objective variable. The study fails to account for the confounding factor.

    38. How can you randomize the items of a list in place in Python?

    Ans:

     Consider the example shown below:

    From random import shuffle:

    • x = [‘Data’, ‘Class’, ‘Blue’, ‘Flag’, ‘Red’, ‘Slow’]
    • shuffle(x)
    • print(x)
    • The output of the following code is as below.
    • [‘Red’, ‘Data’, ‘Blue’, ‘Slow’, ‘Class’, ‘Flag’]

    39. How to get indices of N maximum values in a NumPy array?

    Ans:

    We can get the indices of N maximum values in a NumPy array using the below code:

    • import numpy as np
    • arr = np.array([1, 3, 2, 4, 5])
    • print(arr.argsort()[-3:][::-1])

    Output:

    [ 4 3 1 ]

    40. How make you 3D plots/visualizations using NumPy/SciPy?

    Ans:

     Like 2D plotting, 3D graphics is beyond the scope of NumPy and SciPy, but just as in this 2D example, packages exist that integrate with NumPy. Matplotlib provides primary 3D plotting in the mplot3d subpackage, whereas Mayavi produces a wide range of high-quality 3D visualization features, utilizing the powerful VTK engine.

    41. What are the types of biases that can occur during sampling?

    Ans:

    Some simple models of selection bias are described below. Undercoverage occurs when some members of the population live badly represented inside the sample.The survey relied on a service unit, drawn of telephone directories and car registration lists.

    • Selection bias
    • Under coverage bias
    • Survivorship bias

    42. Which Python library is used for data visualization?

    Ans:

    Plotly. The fifth tool is Plotly, also called Plot.ly because of its main platform online. It is an interactive online visualization tool that is being used for data analytics, scientific graphs, and other visualization. This contains some great API including one for Python

    43. Write code to sort a DataFrame in Python in descending order.

    Ans:

    DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’:

    • na_position=’last’)[source]
    • Sort by the values along either axis
    • Parameters:
    • by: str or list of str
    • Name or list of names to sort by.
    • if an axis is 0 or ‘index’ then by may contain index levels and/or column labels
    • if the axis is 1 or ‘columns’ then by may contain column levels and/or index labels
    • Changed in version 0.23.0: Allow specifying index or column level names.
    • axis : {0 or ‘index’, 1 or ‘columns’}, default 0
    • Axis to be sorted
    • ascending: bool or list of bool, default True
    • Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must
    • match the length of the by.
    • in place: bool, default False
    • if True, perform operation in-place
    • kind: {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
    • Choice of sorting algorithm. See also array.np.sort for more information. mergesort is the only
    • stable algorithm. For DataFrames, this option is only applied when sorting on a single column or
    • label.
    • na_position : {‘first’, ‘last’}, default ‘last’
    • first puts NaNs at the beginning, last put NaNs at the end
    • Returns:
    • sorted_obj: DataFrame

    44. Why should you use NumPy arrays instead of nested Python lists?

    Ans:

    Let’s say you have a list of numbers, and you want to add 1 to every element of the list.

    In regular python, you would do:

    • a = [6, 2, 1, 4, 3]
    • b = [e + 1 fore in a]

    Whereas with numpy, you simply have to do:

    • import numpy as np
    • a = np.array([6, 2, 1, 4, 3])
    • b = a + 1

    It also works for every numpy mathematics function: you can take the exponential of every element of a list using np.exp for example.

    45. Why is an import statement required in Python?

    Ans:

    To be able to use any functionality, the respective code logic needs to be accessible for the Python interpreter. With the help of the import statement, we can use specific scripts. However, there are thousands of such scripts available and every script available cannot be used at once. Hence we import statement to use only the scripts that we want to use

    • import pandas as pd
    • import numpy as np

    46. What is the alias in the import statement? Why is it used?

    Ans:

    Aliases are used in import statements for ease of usage. If the imported module has a large name, for example import multiprocessing . Everytime we want to access any script present in a multiprocessing module, we need  to use the word multiprocessing.

    However if an alias is used, import multiprocessing as mp, we can simply replace the words multiprocessing with mp

    47. Are the aliases used for a module fixed/static ?

    Ans:

     No, the aliases are not pre-fixed. The alias can be named as per your convenience. However, the documentation of a respective module sometimes specifies the alias to be used for ease of understanding.

    48. How to access a specific script inside a module?

    Ans:

    If the whole module needs to be imported, we simply can use from pandas import 

    49. What is a nonparametric test used for?

    Ans:

    Non parametric tests do not assume that the data follows a specific distribution. They can be used whenever the data do not meet the assumptions of  parametric tests.

    50. What are the pros and cons of the Decision Trees algorithm?

    Ans:

    • Pros: Easy to interpret. Will ignore irrelevant independent variables since information gain will be minimal. Can handle missing data. Fast modelling.
    • Cons: Many combinations are possible to create a tree. There are chances that it might not find the best tree possible.

    Course Curriculum

    Learn On-Demand Data Science Course from Real Time Experts

    Weekday / Weekend BatchesSee Batch Details

    51. Name some Classification Algorithms.

    Ans:

    Linear Classifiers: Logistic Regression, Naive Bayes Classifier, Decision Trees, Random Forest, Neural Networks, K Nearest Neighbor.

    52. What are pros and cons of Naive Bayes algorithm?

    Ans:

    • Big sized data is handled easily
    • Multiclass performance is good and accurate
    • It is not process intensive
    • Cons: Assume independence of predictor variables.

    53. What are the types of Skewness?

    Ans:

     A dataset that is skewed right or left are the two types.

    54. What is skewed data?

    Ans:

     A data distribution that has skewed data towards the right or left.

    55. What is the skewness of this data? 27 ; 28 ; 30 ; 32 ; 34 ; 38 ; 41 ; 42 ; 43 ; 44 ; 46 ; 53 ; 56 ; 62

    Ans:

    The data set is skewed left

    56. What is an outlier?

    Ans:

    An outlier is a value that is very much away from the rest of the values in the data set.

    57. Mention the characteristics of symmetric data distribution?

    Ans:

     The mean is equal to the median and the tails of the distribution are balanced.

    58. What are the applications of data science?

    Ans:

    • Optical character recognition,
    • recommendation engines,
    • filtering algorithms,
    • personal assistants,
    • advertising,
    • surveillance,
    • autonomous driving,
    • facial recognition and more.

    59. Define EDA?

    Ans:

    EDA [exploratory data analysis] is an approach to analysing data to summarise their main characteristics, often with visual methods.

    60. What are the steps in exploratory data analysis?

    Ans:

    •  Make summary of observations
    •  describe central tendencies or core part of dataset
    •  describe shape of data
    •  identify potential associations
    •  develop insight into errors, missing values and major deviations

    61. What are the types of data available in Enterprises?

    Ans:

    • Structured data
    • unstructured data
    • big data from social media, surveys, pictures, audio, video, drawings, maps.
    • Machine generated data from instruments
    • real time data feeds

    62. What are the various types of analysis on type of data?

    Ans:

    • Uni variate: 1 variable
    • bi variate: 2 variables
    • multivariate: more than 2 variables

    63. What is the difference between primary data and secondary data?

    Ans:

    Data collected by the interested/self is primary data. This data is collected afresh and first time. Someone else has collected the data and being used by you is secondary data.

    64. What is the difference between qualitative & quantitative ?

    Ans:

    • Quantitative method analyses the data based on numbers.
    • Qualitative method analyses the data by attributes.

    65. What is histogram?

    Ans:

    Histogram is the accurate representation of numerical data based on their occurrences/frequencies.

    66. What are the common measures of central tendencies?

    Ans:

    • Mean
    • Median
    • Mode

    67. What are quartiles?

    Ans:

    Quartiles are three points in the data that divide the data into four groups. Each group consists of a quarter of data.

    68. What are the commonly used error metrics in regression tasks?

    Ans:

    • MSE: Mean squared error, Average of square of errors
    • RMSE: Root mean square error, root of MSE
    • MAPE: Mean absolute percentage error

    69. What are the commonly used error metrics for classification tasks?

    Ans:

    • F1 score
    • Accuracy
    • Sensitivity
    • Specificity
    • Recall
    • Precision

    70. What is it called when there are more than 1 explanatory variable in the regression task?

    Ans:

    Multiple linear regression

    Data Science Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

    71. What are residuals in a regression task?

    Ans:

    The difference between the predicted value and the actual value is called the residual.

    72. What are exploding gradients ?

    Ans:

    Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.

    “Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.” At an extreme, the values of weights can become so large as to overflow and result in NaN values. This has the effect of your model being unstable and unable to learn from your training data. Now let’s understand what is the gradient.

    73. What are the main types of supervised learning tasks?

    Ans:

    • Classification task [categorical in nature]
    • Regression task [continuous in nature]

    74. Can Random forest be used for classification and regression?

    Ans:

    Yes, it can be used.

    75. What is R square value?

    Ans:

    R squared values tells us how close the regression line is fit to the actual values.

    76. What are some common ways of imputation?

    Ans:

    •  Mean imputation,
    • median imputation,
    • KNN imputation,
    • Stochastic regression,
    • substitution

    77. What is the difference between series and list

    Ans:

    • list is size and data mutable
    • series is data mutable but not size mutable

    78. Which function is used to get descriptive statistics of a dataframe?

    Ans:

    • describe()

    79. What parameter is used to update the data without explicitly assigning data to a variable.

    Ans:

    Inplace is used to assign the result of a function to itself. If inplace = True , there is no need to explicitly assign to a variable

    80. What is the difference between a dictionary and a set?

    Ans:

    • Dictionary has key value pair
    • set does not have key value pairs
    • set has only unique elements

    81. How to create a series with letters as index?

    Ans:

    • Series({‘a’:1,’b’:2})

    will create a and b as indexes. 1 and 2 as their respective values.

    82. Which function can be used to filter a DataFrame?

    Ans:

    The query function can be used to filter a dataframe.

    83. What is the function to create a test train split?

    Ans:

     From sklearn.metrics import test_train_split . This function is used to create a test train split from the data.

    84. What is pickling?

    Ans:

    Pickling is the process of saving a data structure into the physical drive or hard disk.

    85. What is unpickling?

    Ans:

    Unpickling is used to read a pickled file from hard disk or physical storage drive.

    86. What are the most common web frameworks of Python?

    Ans:

    Django and Flask.

    87. How to convert a number of series to a dataframe?

    Ans:

    • DataFrame(data = {‘col1’:series1,’col2’:series2}).

    88. How to select a section of a dataframe?

    Ans:

    Using iloc and loc functions the rows and columns can be selected.

    89. How are exceptions handled in Python?

    Ans:

    Exceptions can be handled using the try except statements.

    90. Is multiprocessing possible in python?

    Ans:

    Yes it is possible using the multiprocessing module.

    91. Can the values be replaced in tuples?

    Ans:

    No values cannot be replaced in tuple as tuple is data immutable

    92. What are lambda functions in Python and how is it different from def (defining functions) in Python?

    Ans:

    Lambda function in Python is used for evaluating an expression and then returning a value. Whereas def needs a function name, and the program logic is broken into smaller chunks. Lambda is an inline function consisting of only a single expression, It can take any number of arguments.

    93. Difference between supervised and unsupervised machine learning?

    Ans:

    Supervised learning is a method where it needs training specified data. When it gets to Unsupervised learning it doesn’t need data labeling.

    94. How to differentiate from KNN and K-means clustering?

    Ans:

    KNN is standing for the K- Nearest Neighbours, it remains classified because a supervised algorithm.K-means is an unsupervised cluster algorithm.

    95. What is your opinion on our current data process  ?

    Ans:

    This type of question signifies asked and the individuals must carefully listen to their value case and at the same time, the return should be in a constructive and insightful manner. Based on your responses, the interviewer’s mind has a future to review and know whether you imply a vague reply to their team or not.

    96. Explain the goal of A/B Testing.

    Ans:

    • A/B Testing is a statistical hypothesis testing meant for a randomized experiment with two variables, A and B. The goal of A/B Testing is to maximize the likelihood of an outcome of some interest by identifying any changes to a webpage.
    • A highly reliable method for finding out the best online marketing and promotional strategies for a business, A/B Testing can be employed for testing everything, ranging from sales emails to search ads and website copy.

    97. Explain about the capture of the correlation between continuous and categorical variables?

    Ans:

     It is possible to do that using ANCOVA technique. It exists for Analysis of Covariance. It is used to calculate this association between continuous including categorical variables.

    98. Difference between an Array and a Linked list?

    Ans:

    An array is an established method of collection objects. A linked program is a group of objects that are prepared into sequential order.

    99. Difference between “long” and “wide” format data?

    Ans:

    In the wide form, each subject’s responses will remain in a separate row, and each answer is into a separate column. In the long format, each data is a one-time time by subject. You can understand data in wide form by that fact that columns usually design groups.

    100. What do you know by the term Normal Distribution?

    Ans:

    • Data is usually distributed under many ways including a bias on the port or over the benefit or it can all be jumbled up.
    • However, there continue indications that data is distributed on a central position without bias to the left or right more gives natural order in some form of a bell-shaped curve.

    Are you looking training with Right Jobs?

    Contact Us
    Get Training Quote for Free