# Get Statistics Interview Questions [ TO GET HIRED ]

Last updated on 09th Nov 2021, Blog, Interview Questions

Statistics is a very interesting field and has a lot of impact in today’s world of computing and large data handling. Many companies are investing billions of Dollars into Statistics and understanding Analytics. This gives way for the creation of a lot of jobs in this sector along with the increased competition it brings. To help you with your Statistics interview, we have come up with these interview questions and answers that can guide you on how to approach questions and answer them effectively.

**1. How is the statistical significance of an insight assessed?**

__Ans:__

Hypothesis testing is used to find out the statistical significance of the insight. To elaborate, the null hypothesis and the alternate hypothesis are stated, and the p-value is calculated.

After calculating the p-value, the null hypothesis is assumed true, and the values are determined. To fine-tune the result, the alpha value, which denotes the significance, is tweaked. If the p-value turns out to be less than the alpha, then the null hypothesis is rejected. This ensures that the result obtained is statistically significant.

**2.Where are long-tailed distributions used?**

__Ans:__

A long-tailed distribution is a type of distribution where the tail drops off gradually toward the end of the curve.

The Pareto principle and the product sales distribution are good examples to denote the use of long-tailed distributions. Also, it is widely used in classification and regression problems.

**3. What type of data does not have a log-normal distribution or a Gaussian distribution?
**

__Ans:__

Exponential distributions do not have a log-normal distribution or a Gaussian distribution. Any categorical type of data will not have these distributions as well.

Example: Duration of a phone car, time until the next earthquake, etc.

**4.What is Mean?**

__Ans:__

Mean is the average of a collection of values. We can calculate the mean by dividing the sum of all observations by the number of observations.

**5.What is the meaning of standard deviation?**

__Ans:__

Standard deviation represents the magnitude of how far the data points are from the mean. A low value of standard deviation is an indication of the data being close to the mean, and a high value indicates that the data is spread to extreme ends, far away from the mean.

**6. What is a bell-curve distribution?**

__Ans:__

A normal distribution can be called a bell-curve distribution. It gets its name from the bell curve shape that we get when we visualize the distribution.

**7.What are the types of selection bias in statistics?**

__Ans:__

There are many types of selection bias as shown below:

- Observer selection
- Attrition
- Protopathic bias
- Time intervals
- Sampling bias

**8. What are left-skewed and right-skewed distributions?
**

__Ans:__

A left-skewed distribution is one where the left tail is longer than that of the right tail. Here, it is important to note that the mean < median < mode. Similarly, a right-skewed distribution is one where the right tail is longer than the left one. But, here mean > median > mode.

**9. What is correlation?**

__Ans:__

Correlation is used to test relationships between quantitative variables and categorical variables. Unlike covariance, correlation tells us how strong the relationship is between two variables. The value of correlation between two variables ranges from -1 to +1.

The -1 value represents a high negative correlation, i.e., if the value in one variable increases, then the value in the other variable will drastically decrease. Similarly, +1 means a positive correlation, and here, an increase in one variable will lead to an increase in the other. Whereas, 0 means there is no correlation. If two variables are strongly correlated, then they may harm the statistical model, and one of them must be dropped. Next up on this top Statistics Interview Questions and Answers blog, let us take a look at the intermediate set of questions.

**10. What is the meaning of six sigma in statistics?**

__Ans:__

Six sigma is a quality assurance methodology used widely in statistics to provide ways to improve processes and functionality when working with data.

A process is considered as six sigma when 99.99966% of the outcomes of the model are considered to be defect-free.

**11. What does the Poisson distribution represent?**

__Ans:__

**12. What is DOE?**

__Ans:__

DOE is an acronym for the Design of Experiments in statistics. It is considered as the design of a task that describes the information and the change of the same based on the changes to the independent input variables.

**13. How is missing data handled in statistics?**

__Ans:__

- Prediction of the missing values
- Assignment of individual (unique) values
- Deletion of rows, which have the missing data
- Mean imputation or median imputation
- Using random forests, which support the missing values

There are many ways to handle missing data in Statistics:

**14. What is the Pareto principle?**

__Ans:__

The Pareto principle is also called the 80/20 rule, which means that 80 percent of the results are obtained from 20 percent of the causes in an experiment.

A simple example of the Pareto principle is the observation that 80 percent of peas come from 20 percent of pea plants on a farm.

**15. What is exploratory data analysis?
**

__Ans:__

Exploratory data analysis is the process of performing investigations on data to understand the data better.

In this, initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and also check if the assumptions are right.

**16. What is the meaning of selection bias?**

__Ans:__

- Selection bias is a phenomenon that involves the selection of individual or grouped data in a way that is not considered to be random.
- Randomization plays a key role in performing analysis and understanding model functionality better.
- If correct randomization is not achieved, then the resulting sample will not accurately represent the population.

**17. What is the probability of throwing two fair dice when the sum is 5 and 8?**

__Ans:__

- There are 4 ways of rolling a 5 (1+4, 4+1, 2+3, 3+2):
- P(Getting a 5) = 4/36 = 1/9
- Now, there are 7 ways of rolling an 8 (1+7, 7+1, 2+6, 6+2, 3+5, 5+3, 4+4)
- P(Getting an 8) = 7/36 = 0.194

**18. What is the meaning of an inlier?**

__Ans:__

An inlier is a data point that lies at the same level as the rest of the dataset. Finding an inlier in the dataset is difficult when compared to an outlier as it requires external data to do so. Inliers, similar to outliers reduce model accuracy. Hence, even they are removed when they’re found in the data. This is done mainly to maintain model accuracy at all times.

**19. What are the types of sampling in Statistics?**

__Ans:__

There are four main types of data sampling as shown below:

**Simple random:**Pure random division**Cluster:**Population divided into clusters**Stratified:**Data divided into unique groups**Systematical:**Picks up every ‘n’ member in the data

**20. What is the meaning of covariance?**

__Ans:__

- 2-tail test: Critical region is on both sides of the distribution
- H0: x = µ
- H1: x <> µ

- 1-tail test: Critical region is on one side of the distribution
- H1: x <= µ
- H1: x > µ

**21. What do you understand by inferential statistics?
**

__Ans:__

When we try to form a conclusion about the population by conducting the experiments on the sample taken from the population.

**22. What is P-value and explain it?**

__Ans:__

When we execute a hypothesis test in statistics, a p-value helps us determine our results’ significance. These Hypothesis tests are nothing but to test the validity of a claim that is made about a population. A null hypothesis is when the hypothesis and the specified population are with no significant difference due to sampling or experimental error.

**23. What is the meaning of the five-number summary in Statistics?**

__Ans:__

The five-number summary is a measure of five entities that cover the entire range of data as shown below:

- Low extreme (Min)
- The first quartile (Q1)
- Median
- Upper quartile (Q3)
- High extreme (Max)

**24. What are population and sample in Inferential Statistics, and how are they different?**

__Ans:__

A population is a large volume of observations (data). The sample is a small portion of that population. Because of the large volume of data in the population, it raises the computational cost. The availability of all data points in the population is also an issue.

In short:

- We calculate the statistics using the sample.
- Using these sample statistics, we make conclusions about the population.

**25. What are quantitative data and qualitative data?**

__Ans:__

- Quantitative data is also known as numeric data.
- Qualitative data is also known as categorical data.

**26. What is the difference between Descriptive and Inferential Statistics?**

__Ans:__

**27. List the fields where a statistic can be used?**

__Ans:__

Statistics can be used in many research fields. Below are the lists of files in which statistics can be used:

- Science
- Technology
- Business
- Biology
- Computer Science
- Chemistry
- It aids in decision-making.
- Provides comparison
- Explains the action that has taken place
- Predict the future outcome

**28. What is Bessel’s correction?**

__Ans:__

Bessel’s correction is a factor that is used to estimate a populations’ standard deviation from its sample. It causes the standard deviation to be less biased, thereby, providing more accurate results.

**29. What is the relationship between the confidence level and the significance level in statistics?**

__Ans:__

The significance level is the probability of obtaining a result that is extremely different from the condition where the null hypothesis is true. While the confidence level is used as a range of similar values in a population.

Both significance and confidence level are related by the following formula:

Significance level = 1 − Confidence level

**30. What types of variables are used for Pearson’s correlation coefficient?**

__Ans:__

- Variables to be used for the Pearson’s correlation coefficient must be either in a ratio or in an interval.
- Note that there can exist a condition when one variable is a ratio, while the other is an interval score.

**31. Where is inferential statistics used?**

__Ans:__

Inferential statistics is used for several purposes, such as research, in which we wish to draw conclusions about a population using some sample data. This is performed in a variety of fields, ranging from government operations to quality control and quality assurance teams in multinational corporations.

**32. What is the relationship between mean and median in a normal distribution?**

__Ans:__

In a normal distribution, the mean is equal to the median. To know if the distribution of a dataset is normal, we can just check the dataset’s mean and median.

**33. What are the scenarios where outliers are kept in the data?**

__Ans:__

There are not many scenarios where outliers are kept in the data, but there are some important situations when they are kept. They are kept in the data for analysis if:

- Results are critical
- Outliers add meaning to the data
- The data is highly skewed

**34. How can you calculate the p-value using MS Excel?**

__Ans:__

Following steps are performed to calculate the p-value easily:

- Find the Data tab above
- Click on Data Analysis
- Select Descriptive Statistics
- Select the corresponding column
- Input the confidence level

**35. Can you give an example to denote the working of the central limit theorem?**

__Ans:__

Let’s consider the population of men who have normally distributed weights, with a mean of 60 kg and a standard deviation of 10 kg, and the probability needs to be found out.

If one single man is selected, the weight is greater than 65 kg, but if 40 men are selected, then the mean weight is far more than 65 kg.

The solution to this can be as shown below:

- Z = (x − µ) / ? = (65 − 60) / 10 = 0.5
- For a normal distribution P(Z > 0.5) = 0.409
- Z = (65 − 60) / 5 = 1
- P(Z > 1) = 0.090

**36. What is the relationship between standard deviation and standard variance?**

__Ans:__

Standard deviation is the square root of standard variance. Basically, standard deviation takes a look at how the data is spread out from the mean. On the other hand, standard variance is used to describe how much the data varies from the mean of the entire dataset. Next

**37. When creating a statistical model, how do we detect overfitting?**

__Ans:__

Overfitting can be detected by cross-validation. In cross-validation, we divide the available data into multiple parts and iterate on the entire dataset. In each iteration, one part is used for testing, and others are used for training. This way, the entire dataset will be used for training and testing purposes, and we can detect if the data is being overfitted.

**38. What are some of the techniques to reduce underfitting and overfitting during model training?**

__Ans:__

Underfitting refers to a situation where data has high bias and low variance, while overfitting is the situation where there are high variance and low bias.

Following are some of the techniques to reduce underfitting and overfitting:

- Increase model complexity
- Increase the number of features
- Remove noise from the data
- Increase the number of training epochs

**For reducing underfitting:**

- Increase training data
- Stop early while training
- Lasso regularization
- Use random dropouts

**For reducing overfitting:**

**39. What is the use of Hash tables in statistics?**

__Ans:__

Hash tables are the data structures that are used to denote the representation of key-value pairs in a structured way. The hashing function is used by a hash table to compute an index that contains all of the details regarding the keys that are mapped to their associated values.

**40. What is the meaning of TF/IDF vectorization?**

__Ans:__

TF-IDF is an acronym for Term Frequency – Inverse Document Frequency. It is used as a numerical measure to denote the importance of a word in a document. This document is usually called the collection or the corpus.

The TF-IDF value is directly proportional to the number of times a word is repeated in a document. TF-IDF is vital in the field of Natural Language Processing (NLP) as it is mostly used in the domain of text mining and information retrieval.

**41. What does Design of Experiments mean?**

__Ans:__

Design of experiments also known as DOE, it is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variable. In essence, an experiment aims to predict an outcome based on a change in one or more inputs (independent variables).

**42. How do you assess the statistical significance of an insight?**

__Ans:__

- First, you would state the null hypothesis and alternative hypothesis.
- Second, you would calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true.
- Last, you would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.

You would perform hypothesis testing to determine statistical significance.

**43. Is mean imputation of missing data acceptable practice? Why or why not?**

__Ans:__

Mean imputation is generally bad practice because it doesn’t take into account feature correlation. For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that he actually should.

Second, mean imputation reduces the variance of the data and increases bias in our data. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.

**44. What are the assumptions required for linear regression?**

__Ans:__

**There are four major assumptions:**

1.There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data.

2.The errors or residuals of the data are normally distributed and independent from each other.

3.There is minimal multicollinearity between explanatory variables.

4.Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

**45. What is the difference between population parameters and sample statistics?**

__Ans:__

- Mean = µ
- Standard deviation = σ

**Population parameters are:**

- Mean = x (bar)
- Standard deviation = s

**Sample statistics are:**

**46. What is the Central Limit Theorem?**

__Ans:__

Central Limit Theorem is the cornerstone of statistics. It states that the distribution of a sample from a population comprising a large sample size will have its mean normally distributed. In other words, it will not have any effect on the original population distribution.

**47. What is the Binomial Distribution Formula**

__Ans:__

The binomial distribution formula is:

*b(x; n, P) = nCx * Px * (1 – P)n – x*

Where:

b = binomial probability

x = total number of “successes” (pass or fail, heads or tails, etc.)

P = probability of success on an individual trial

n = number of trials

**48. What is the difference between population and sample?**

__Ans:__

Population vs sample

Advertisements for IT jobs in the Netherlands. | The top 50 search results for advertisements for IT jobs in the Netherlands on May 1, 2020. |

Songs from the Eurovision Song Contest. | Winning songs from the Eurovision Song Contest that were performed in English. |

Undergraduate students in the Netherlands. | 300 undergraduate students from three Dutch universities who volunteer for your psychology research study. |

All countries of the world. | Countries with published data available on birth rates and GDP since 2000. |

**49. What is kurtosis?**

__Ans:__

Kurtosis is a measure of the degree of the extreme values present in one tail of distribution or the peaks of frequency distribution as compared to the others. The standard normal distribution has a kurtosis of 3 whereas the values of symmetry and kurtosis between -2 and +2 are considered normal and acceptable. The data sets with a high level of kurtosis imply that there is a presence of outliers. One needs to add data or remove outliers to overcome this problem. Data sets with low kurtosis levels have light tails and lack outliers.

**50. What is a bell-curve distribution?**

__Ans:__

A bell-curve distribution is represented by the shape of a bell and indicates normal distribution. It occurs naturally in many situations especially while analyzing financial data. The top of the curve shows the mode, mean and median of the data and is perfectly symmetrical. The key characteristics of a bell-shaped curve are –

- The empirical rule says that approximately 68% of data lies within one standard deviation of the mean in either of the directions.
- Around 95% of data falls within two standard deviations and
- Around 99.7% of data fall within three standard deviations in either direction.

**51. What Is Significance Level?**

__Ans:__

The probability of rejecting the null hypothesis when it is called the significance level α , and very common choices are α = 0.05 and α = 0.01.?

**52. What to do you understand by right skewness? Give example.**

__Ans:__

When the data is not normally distributed and we have tail type elongated line on the right side, that is called right skewness.

**For example:**

Income distrubution.

**53. What is the difference between Data Science and Statistics?**

__Ans:__

Data Science is a science that is led by data. It includes the interdisciplinary fields of scientific methods, algorithms, and even the process for extracting insights from the data. The data can be either structured or unstructured. There are many similarities between data science and data mining as both useful abstract information from the data. Now, data science also includes mathematical statistics and computer science and its applications.

It is by the combination of statistics, visualization, and applied mathematics and computer science that data science can convert a vast amount of data into insights and knowledge. Thus, statistics from the main part of data science it is a branch of mathematical commerce with the collection, analysis, interpretation, organization, and presentation of data.

**54. What are the various methods of sampling?**

__Ans:__

- Randomly or in a simple yet random method
- Systematically or taking every kth member of the population
- Cluster when the population is considered in groups or clusters
- Stratified i.e. when the exclusive groups or strata, a sample from a group) samplings.

Sampling can be done in 4 broad methods:

**55. What is the meaning of sensitivity in statistics?**

__Ans:__

Sensitivity, as the name suggests, is used to determine the accuracy of a classifier (logistic, random forest, etc.):

The simple formula to calculate sensitivity is:

- Sensitivity = Predicted True Events/Total number of Events

**56. What are the types of biases that you can encounter while sampling?**

__Ans:__

- Selection bias
- Survivorship bias
- Under coverage bias

There are three types of biases:

**57. What is the benefit of using box plots?**

__Ans:__

Box plots allow us to provide a graphical representation of the 5-number summary and can also be used to compare groups of histograms.

**58. List all the other models that work with statistics to analyze the data?**

__Ans:__

Statistics, along with Data Analytics, analyzes the data and help a business to make good decisions. Predictive ‘Analytics’ and ‘Statistics’ are useful to analyze current data and historical data to make predictions about future events.

**59. How to calculate range and interquartile range?**

__Ans:__

- IQR = Q3 – Q1

Where, Q3 is the third quartile (75 percentile)

Where, Q1 is the first quartile (25 percentile)

**60. What can I do with outlier?**

__Ans:__

- When we know the data-point is wrong (negative age of a person)
- When we have lots of data
- We should provide two analyses. One with outliers and another without outliers.
- When there are lot of outliers (skewed data)
- When results are critical
- When outliers have meaning (fraud data)

**Remove outlier**

**Keep outlier**

**61. How to find the mean length of all fishes in the sea?**

__Ans:__

- Define the confidence level (most common is 95%)
- Take a sample of fishes from the sea (to get better results the number of fishes > 30)
- Calculate the mean length and standard deviation of the lengths
- Calculate t-statistics
- Get the confidence interval in which the mean length of all the fishes should be.

**62. What is the difference between 95% confidence level and 99% confidence level?**

__Ans:__

The confidence interval increases as me move from 95% confidence level to 99% confidence level

**63. Explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset.**

__Ans:__

**Photo from Michael Galarnyk**

An inlier is a data observation that lies within the rest of the dataset and is unusual or an error. Since it lies in the dataset, it is typically harder to identify than an outlier and requires external data to identify them. Should you identify any inliers, you can simply remove them from the dataset to address them.

**64. How do you handle missing data? What imputation techniques do you recommend?**

__Ans:__

- Delete rows with missing data
- Mean/Median/Mode imputation
- Assigning a unique value
- Predicting the missing values
- Using an algorithm which supports missing values, like random forests

There are several ways to handle missing data:

**65. Can you use Selenium for testing Rest API or Web services?**

__Ans:__

Selenium provides Native APIs for interacting with the browser using actions and events. The Rest API and the web services don’t have any UI and hence can’t be automated using Selenium.

**66. Give an example where the median is a better measure than the mean?**

__Ans:__

When there are a number of outliers that positively or negatively skew the data.

**67. Given two fair dices, what is the probability of getting scores that sum to 4? to 8?**

__Ans:__

There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):

*P(rolling a 4) = 3/36 = 1/12*

There are combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):

*P(rolling an 8) = 5/36*

**68. What is the Central Limit Theorem? Explain it. Why is it important?**

__Ans:__

**Statistics How To provides the best definition of CLT, which is:**

“The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.”

The central limit theorem is important because it is used in hypothesis testing and also to calculate confidence intervals.

**69. How to identify data is Skewed or not skewed? and what are different types of skewed and how to identify them?**

__Ans:__

If in the numerical data is their mean and median are different then this is the indication of data is Skewed. If in the data contain Mean = Median = Mode data is normally distributed.

There are two different types of skewness:

**Positive Skewed:-** Most of the data is present in the left side of distribution and tail is downward towards the right side then data is positive skewed or Mean > Median > Mode this is also known as positive skewed

**Negative Skewed:-** Most of the data is present in the right side of distribution and tail is downward towards the left side then data is negatively skewed or Mean< Median < Mode this is also known as Negative Skewed

**70. Why Scaling is Required?**

__Ans:__

Most machine learning algorithms take into account only the magnitude of the measurements, not the units of those measurements.

So that is expressed in a very high magnitude (number), which may affect the prediction a lot more than an equally important feature.

**71. Explain Univariate and Bivariate Graph analysis.**

__Ans:__

Univariate Graph | Bivariate Graph |
---|---|

Univariate Graph analysis used only one variable to get some analysis. | Bivariate Graph analysis used two variables to get analysis their relation. |

Plots use in univariate analysis Countplot, Distribution plot, Histogram, etc. | data uses both Qualitative and Quantitative Data Graph use for Bivariate data is scatter plot, bar graph, etc. |

**72. What are the types of modalities?**

__Ans:__

**Unimodal:-** It has only one peak

**Bimodal:-** It has two peak

**Multimodal:-** It has many peak

**Uniform:-** All are distributed uniformly

**73. What makes the difference between “Long” and “Wide” Format data? **

__Ans:__

In a wide format method, when we take a subject, the repeated responses are recorded in a single row, and each recorded response is in a separate column. When it comes to Long format data, each row acts as a one-time point per subject. In wide format, the columns are generally divided into groups whereas in a long-form the rows are divided into groups.

**74. Explain type I and Type II errors?**

__Ans:__

The type I error generally occurs when the null hypothesis is true, but when rejected. And when we consider the Type II error, it occurs when the null hypothesis is false but erroneously fails to be rejected.

**75. Explain the Process of data analysis?**

__Ans:__

**Process of data analysis**

**76. What is: lift, KPI, robustness?**

__Ans:__

**Lift**: lift is a measure of the performance of a targeting model measured against a random choice targeting model; in other words, lift tells you how much better your model is at predicting things than if you had no model.

**KPI:**stands for Key Performance Indicator, which is a measurable metric used to determine how well a company is achieving its business objectives. Eg. error rate.

**Robustness:** generally robustness refers to a system’s ability to handle variability and remain effective.

**77. Define quality assurance, six sigma.**

__Ans:__

**Quality assurance:** an activity or set of activities focused on maintaining a desired level of quality by minimizing mistakes and defects.

**Six sigma:** a specific type of quality assurance methodology composed of a set of techniques and tools for process improvement. A six sigma process is one in which 99.99966% of all outcomes are free of defects.

**78. Give examples of data that does not have a Gaussian distribution, nor log-normal?**

__Ans:__

- Any type of categorical data won’t have a gaussian distribution or lognormal distribution.
- Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.

**79. What are confounding variables?**

__Ans:__

A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not causally related.

**80. What is meant by the statistical power of Sensitivity, and how can we calculate it? **

__Ans:__

The word Sensitivity is often used in validating the accuracy of a classifier (SVM, Random Forest, Logistics, etc.).

In statistical analysis, Sensitivity is treated as predicted events that are true. The true events are nothing but the events which are actually true in nature and the model also predicted them as true.

The calculation of seasonality is pretty straightforward.

Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )

**81. What is the meaning of degrees of freedom (DF) in statistics?**

__Ans:__

Degrees of freedom or DF is used to define the number of options at hand when performing an analysis. It is mostly used with t-distribution and not with the z-distribution.

**82. How can you calculate the p-value using MS Excel?**

__Ans:__

Following steps are performed to calculate the p-value easily:

- Find the Data tab above
- Click on Data Analysis
- Select Descriptive Statistics
- Select the corresponding column
- Input the confidence level

**83. In a scatter diagram, what is the line that is drawn above or below the regression line called?**

__Ans:__

The line that is drawn above or below the regression line in a scatter diagram is called the residual or also the prediction error.

**84. What is meant by linear regression?**

__Ans:__

Linear regression is commonly used for conducting predictive analysis. It helps us in examining two things. For instance, linear regression is used to compare two factors that belong to a particular thing. Let’s say the price of a house depends on two different factors such as location and size. To find the relationship between these factors, we need to conduct a linear regression. Linear regression helps us in finding the positive or negative effects of these two relationships.

**85. What is: model fitting, 80/20 rule?**

__Ans:__

**Model fitting:** refers to how well a model fits a set of observations.

**80/20 rule:** also known as the Pareto principle; states that 80% of the effects come from 20% of the causes. Eg. 80% of sales come from 20% of customers.

**86. What is Design of experiments**

__Ans:__

**Design of experiments:** also known as DOE, it is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variable. [4] In essence, an experiment aims to predict an outcome based on a change in one or more inputs (independent variables).

**87. What do you think of the tail (one tail or two tail) if H0 is equal to one value only?**

__Ans:__

It is a two-tail test

**88. What is the critical value in one tail or two-tail test?**

__Ans:__

- Critical value in 1-tail = alpha
- Critical value in 2-tail = alpha / 2

**89. What is the proportion of confidence interval that will not contain the population parameter?**

__Ans:__

Alpha is the portion of confidence interval that will not contain the population parameter

- α = 1 – CL

**90. What is Binary Search?**

__Ans:__

In any binary search, the array has to be arranged either in ascending or descending order. In every step, the search key value is compared with the key value of the middle element of the array by the algorithm. If both the keys match, a matching element is discovered, and the index or the position is returned. Else, if the search key falls below the key of the middle element, then the algorithm will repeat the action on the sub-array which falls to the left of the middle element of the array if the search key is more than the sub-array to the right.

**91. What are the effects of the width of confidence interval?**

__Ans:__

- Confidence interval is used for decision making
- As the confidence level increases the width of the confidence interval also increases
- As the width of the confidence interval increases, we tend to get useless information also.
- Useless information – wide CI
- High risk – narrow CI

**92. How to convert normal distribution to standard normal distribution?**

__Ans:__

Standardized normal distribution has mean = 0 and standard deviation = 1

To convert normal distribution to standard normal distribution we can use the formula

*X (standardized) = (x-µ) / σ*

**93. What is an Interqurtile Range( IQR )?**

__Ans:__

The main advantage of the IQR is that it is not affected by outliers because it doesn’t take into account observations below Q1 or above Q3.

It might still be useful to look for possible outliers in your study.

As a rule of thumb, observations can be qualified as outliers when they lie more than 1.5 IQR below the first quartile or 1.5 IQR above the third quartile. Outliers are values that “lie outside” the other values.

- Outliers = Q1 – 1.5 * IQR OR Outliers = Q3 + 1.5 * IQR

**94. What is left skewed distribution and right skewed distribution?**

__Ans:__

**Left skewed**- The left tail is longer than the right side
- Mean < median < mode

**Right skewed**- The right tail is longer than the right side
- Mode < median < mean

**95. Why we need 5-number summary?**

__Ans:__

- Low extreme (minimum)
- Lower quartile (Q1)
- Median
- Upper quartile (Q3)
- Upper extreme (maximum)

**96. What general conditions must be satisfied for the central limit theorem to hold?**

__Ans:__

- The data must follow the randomization condition which means that it must be sampled randomly.
- The Independence Assumptions dictate that the sample values must be independent of each other.
- Sample sizes must be large. They must be equal to or greater than 30 to be able to hold CLT. Large sample size is required to hold the accuracy of CLT to be true.

**97. How to detect outliers?**

__Ans:__

The best way to detect outliers is through graphical means. Apart from that, outliers can also be detected through the use of statistical methods using tools such as Excel, Python, SAS, among others. The most popular graphical ways to detect outliers include box plot and scatter plot.

**98. How do you calculate the needed sample size?**

__Ans:__

- t/z = t/z score used to calculate the confidence interval
- ME = the desired margin of error
- S = sample standard deviation

You can use the margin of error (ME) formula to determine the desired sample size.

**99. A random variable X is normal with mean 1020 and a standard deviation 50. Calculate P(X>1200)?**

__Ans:__

Using Excel…

*p =1-norm.dist(1200, 1020, 50, true)*

*p= 0.000159*

**Data Scientist Sample Resumes! Download & Edit, Get Noticed by Top Employers!**Download

**100.Difference between Point Estimates and Confidence Interval?**

__Ans:__

For population parameter particular value as an estimate gives us point estimation. | The probability that the interval contains the parameter is the confidence interval. |

Methods like moments and maximum likelihood estimator methods are the point estimation population parameter. | It quantifies the level of confidence that the parameter lies in the interval. |