35+ [REAL-TIME] Data Analyst Interview Questions & Answers
Last updated on 04th Jul 2020, Blog, Interview Questions
A data analyst collects, processes and performs statistical analyses on large dataset. They discover how data can be used to answer questions and solve problems. With the development of computers and an ever increasing move toward technological intertwinement, data analysis has evolved.
1) What is the process of Data Analysis?
Data analysis is the process of collecting, cleansing, interpreting, transforming and modeling data to gather insights and generate reports to gain business profits. Refer to the image below to know the various steps involved in the process.
- Collect Data: The data gets collected from various sources and is stored so that it can be cleaned and prepared. In this step, all the missing values and outliers are removed.
- Analyse Data: Once the data is ready, the next step is to analyze the data. A model is run repeatedly for improvements. Then, the mode is validated to check whether it meets the business requirements.
- Create Reports: Finally, the model is implemented and then reports thus generated are passed onto the stakeholders.
2) What is the difference between Data Mining and Data Profiling?
- Data Mining: Data Mining refers to the analysis of data with respect to finding relations that have not been discovered earlier. It mainly focuses on the detection of unusual records, dependencies and cluster analysis.
- Data Profiling: Data Profiling refers to the process of analyzing individual attributes of data. It mainly focuses on providing valuable information on data attributes such as data type, frequency etc.
3) What is data cleansing and what are the best ways to practice data cleansing?
Data Cleansing or Wrangling or Data Cleaning. All mean the same thing. It is the process of identifying and removing errors to enhance the quality of data. You can refer to the below image to know the various ways to deal with missing data.
4) What are the important steps in the data validation process?
As the name suggests Data Validation is the process of validating data. This step mainly has two processes involved in it. These are Data Screening and Data Verification.
- Data Screening: Different kinds of algorithms are used in this step to screen the entire data to find out any inaccurate values.
- Data Verification: Each and every suspected value is evaluated on various use-cases, and then a final decision is taken on whether the value has to be included in the data or not.
5) What do you think are the criteria to say whether a developed data model is good or not?
Well, the answer to this question may vary from person to person. But below are a few criteria which I think are a must to be considered to decide whether a developed data model is good or not:
- A model developed for the dataset should have predictable performance. This is required to predict the future.
- A model is said to be a good model if it can easily adapt to changes according to business requirements.
- If the data gets changed, the model should be able to scale according to the data.
- The model developed should also be able to easily consumed by the clients for actionable and profitable results.
6) When do you think you should retrain a model? Is it dependent on the data?
Business data keeps changing on a day-to-day basis, but the format doesn’t change. As and when a business operation enters a new market, sees a sudden rise of opposition or sees its own position rising or falling, it is recommended to retrain the model. So, as and when the business dynamics change, it is recommended to retrain the model with the changing behaviors of customers.
7) Can you mention a few problems that data analyst usually encounter while performing the analysis?
The following are a few problems that are usually encountered while performing data analysis.
- Presence of Duplicate entries and spelling mistakes, reduce data quality.
- If you are extracting data from a poor source, then this could be a problem as you would have to spend a lot of time cleaning the data.
- When you extract data from sources, the data may vary in representation. Now, when you combine data from these sources, it may happen that the variation in representation could result in a delay.
- Lastly, if there is incomplete data, then that could be a problem to perform analysis of data.
8) What is the KNN imputation method?
This method is used to impute the missing attribute values which are imputed by the attribute values that are most similar to the attribute whose values are missing. The similarity of the two attributes is determined by using the distance functions.
9) Mention the name of the framework developed by Apache for processing large dataset for an application in a distributed computing environment?
The complete Hadoop Ecosystem was developed for processing large dataset for an application in a distributed computing environment. The Hadoop Ecosystem consists of the following Hadoop components.
- HDFS -> Hadoop Distributed File System
- YARN -> Yet Another Resource Negotiator
- MapReduce -> Data processing using programming
- Spark -> In-memory Data Processing
- PIG, HIVE-> Data Processing Services using Query (SQL-like)
- HBase -> NoSQL Database
- Mahout, Spark MLlib -> Machine Learning
- Apache Drill -> SQL on Hadoop
- Zookeeper -> Managing Cluster
- Oozie -> Job Scheduling
- Flume, Sqoop -> Data Ingesting Services
- Solr & Lucene -> Searching & Indexing
- Ambari -> Provision, Monitor and Maintain cluster
10) Can you tell what is a waterfall chart and when do we use it?
The waterfall chart shows both positive and negative values which lead to the final result value. For example, if you are analyzing a company’s net income, then you can have all the cost values in this chart. With such kind of a chart, you can visually, see how the value from revenue to the net income is obtained when all the costs are deducted.
11) How can you highlight cells with negative values in Excel?
You can highlight cells with negative values in Excel by using the conditional formatting. Below are the steps that you can follow:
- Select the cells which you want to highlight with the negative values.
- Go to the Home tab and click on the Conditional Formatting option
- Go to the Highlight Cell Rules and click on the Less Than option.
- In the dialog box of Less Than, specify the value as 0.
12) How can you clear all the formatting without actually removing the cell contents?
Sometimes you may want to remove all the formatting and just want to have the basic/simple data. To do this, you can use the ‘Clear Formats’ options found in the Home Tab. You can evidently see the option when you click on the ‘Clear’ drop down.
13) What is a Pivot Table, and what are the different sections of a Pivot Table?
A Pivot Table is a simple feature in Microsoft Excel which allows you to quickly summarize huge datasets. It is really easy to use as it requires dragging and dropping rows/columns headers to create reports.
A Pivot table is made up of four different sections:
- Values Area: Values are reported in this area
- Rows Area: The headings which are present on the left of the values.
- Column Area: The headings at the top of the values area makes the columns area.
- Filter Area: This is an optional filter used to drill down in the data set.
14) Can you make a Pivot Table from multiple tables?
Yes, we can create one Pivot Table from multiple different tables when there is a connection between these tables.
15) How can we select all blank cells in Excel?
If you wish to select all the blank cells in Excel, then you can use the Go To Special Dialog Box in Excel. Below are the steps that you can follow to select all the blank cells in Excel.
- First, select the entire dataset and press F5. This will open a Go To Dialog Box.
- Click the ‘Special‘ button which will open a Go To special Dialog box.
- After that, select the Blanks and click on OK.
16) What are the most common questions you should ask a client before creating a dashboard?
Well, the answer to this question varies on a case-to-case basis. But, here are a few common questions that you can ask while creating a dashboard in Excel.
- Purpose of the Dashboards
- Different data sources
- Usage of the Excel Dashboard
- The frequency at which the dashboard needs to be updated
- The version of Office the client uses.
17) What is a Print Area and how can you set it in Excel?
A Print Area in Excel is a range of cells that you designate to print whenever you print that worksheet. For example, if you just want to print the first 20 rows from the entire worksheet, then you can set the first 20 rows as the Print Area.
Now, to set the Print Area in Excel, you can follow the below steps:
- Select the cells for which you want to set the Print Area.
- Then, click on the Page Layout Tab.
- Click on Print Area.
- Click on Set Print Area.
18) What steps can you take to handle slow Excel workbooks?
Well, there are various ways to handle slow Excel workbooks. But, here are a few ways in which you can handle workbooks.
- Try using manual calculation mode.
- Maintain all the referenced data in a single sheet.
- Often use excel tables and named ranges.
- Use Helper columns instead of array formulas.
- Try to avoid using entire rows or columns in references.
- Convert all the unused formulas to values.
19) What is the Alternative Hypothesis?
- To explain the Alternative Hypothesis, you can first explain what the null hypothesis is. Null Hypothesis is a statistical phenomenon that is used to test for possible rejection under the assumption that result of chance would be true.
- After this, you can say that the alternative hypothesis is again a statistical phenomenon which is contrary to the Null Hypothesis. Usually, it is considered that the observations are a result of an effect with some chance of variation.
20) What is the difference between univariate, bivariate and multivariate analysis?
The differences between univariate, bivariate and multivariate analysis are as follows:
- Univariate: A descriptive statistical technique that can be differentiated based on the count of variables involved at a given instance of time.
- Bivariate: This analysis is used to find the difference between two variables at a time.
- Multivariate: The study of more than two variables is nothing but multivariate analysis. This analysis is used to understand the effect of variables on the responses.
21)What are the important responsibilities of a data analyst?
This is the most commonly asked data analyst interview question. You must have a clear idea as to what your job entails. A data analyst is required to perform the following tasks:
- Collect and interpret data from multiple sources and analyze results.
- Filter and “clean” data gathered from multiple sources.
- Offer support to every aspect of data analysis.
- Analyze complex datasets and identify the hidden patterns in them.
- Keep databases secured.
22) What does “Data Cleansing” mean? What are the best ways to practice this?
- If you are sitting for a data analyst job, this is one of the most frequently asked data analyst interview questions.
- Data cleansing primarily refers to the process of detecting and removing errors and inconsistencies from the data to improve data quality.
The best ways to clean data are:
- Segregating data, according to their respective attributes.
- Breaking large chunks of data into small datasets and then cleaning them.
- Analyzing the statistics of each data column.
- Creating a set of utility functions or scripts for dealing with common cleaning tasks.
- Keeping track of all the data cleansing operations to facilitate easy addition or removal from the datasets, if required.
23) Name the best tools used for data analysis.
A question on the most used tool is something you’ll mostly find in any data analytics interview questions.
The most useful tools for data analysis are:
- Google Fusion Tables
- Google Search Operators
24) What is the difference between data profiling and data mining?
Data Profiling focuses on analyzing individual attributes of data, thereby providing valuable information on data attributes such as data type, frequency, length, along with their discrete values and value ranges. On the contrary, data mining aims to identify unusual records, analyze data clusters, and sequence discovery, to name a few.
25) What is KNN imputation method?
KNN imputation method seeks to impute the values of the missing attributes using those attribute values that are nearest to the missing attribute values. The similarity between two attribute values is determined using the distance function.
26) What should a data analyst do with missing or suspected data?
In such a case, a data analyst needs to:
- Use data analysis strategies like deletion method, single imputation methods, and model-based methods to detect missing data.
- Prepare a validation report containing all information about the suspected or missing data.
- Scrutinize the suspicious data to assess their validity.
- Replace all the invalid data (if any) with a proper validation code.
27) Name the different data validation methods used by data analysts.
There are many ways to validate datasets. Some of the most commonly used data validation methods by Data Analysts include:
- Field Level Validation – In this method, data validation is done in each field as and when a user enters the data. It helps to correct the errors as you go.
- Form Level Validation – In this method, the data is validated after the user completes the form and submits it. It checks the entire data entry form at once, validates all the fields in it, and highlights the errors (if any) so that the user can correct it.
- Data Saving Validation – This data validation technique is used during the process of saving an actual file or database record. Usually, it is done when multiple data entry forms must be validated.
- Search Criteria Validation – This validation technique is used to offer the user accurate and related matches for their searched keywords or phrases. The main purpose of this validation method is to ensure that the user’s search queries can return the most relevant results.
28) Define Outlier
A data analyst interview question and answers guide will not complete without this question. An outlier is a term commonly used by data analysts when referring to a value that appears to be far removed and divergent from a set pattern in a sample. There are two kinds of outliers – Univariate and Multivariate.
The two methods used for detecting outliers are:
- Box plot method – According to this method, if the value is higher or lesser than 1.5*IQR (interquartile range), such that it lies above the upper quartile (Q3) or below the lower quartile (Q1), the value is an outlier.
- Standard deviation method – This method states that if a value is higher or lower than mean ± (3*standard deviation), it is an outlier.
29) What is “Clustering?” Name the properties of clustering algorithms.
Clustering is a method in which data is classified into clusters and groups. A clustering algorithm has the following properties:
- Hierarchical or flat
- Hard and soft
30) How can a Data Analyst highlight cells containing negative values in an Excel sheet?
Final question in our data analyst interview questions and answers guide. A Data Analyst can use conditional formatting to highlight the cells having negative values in an Excel sheet. Here are the steps for conditional formatting:
- First, select the cells that have negative values.
- Now, go to the Home tab and choose the Conditional Formatting option.
- Then, go to the Highlight Cell Rules and select the Less Than option.
- In the final step, you must go to the dialog box of the Less Than option and enter “0” as the value.
31) What are the advantages of version control?
The main advantages of version control are –
- It allows you to compare files, identify differences, and consolidate the changes seamlessly.
- It helps to keep track of application builds by identifying which version is under which category – development, testing, QA, and production.
- It maintains a complete history of project files that comes in handy if ever there’s a central server breakdown.
- It is excellent for storing and maintaining multiple versions and variants of code files securely.
- It allows you to see the changes made in the content of different files.
32) Explain the difference between R-Squared and Adjusted R-Squared.
The R-Squared technique is a statistical measure of the proportion of variation in the dependent variables, as explained by the independent variables. The Adjusted R-Squared is essentially a modified version of R-squared, adjusted for the number of predictors in a model. It provides the percentage of variation explained by the specific independent variables that have a direct impact on the dependent variables.
33) Explain univariate, bivariate, and multivariate analysis.
Univariate analysis refers to a descriptive statistical technique that is applied to datasets containing a single variable. The univariate analysis considers the range of values and also the central tendency of the values. Bivariate analysis simultaneously analyzes two variables to explore the possibilities of an empirical relationship between them. It tries to determine if there is an association between the two variables and the strength of the association, or if there are any differences between the variables and what is the importance of these differences. Multivariate analysis is an extension of bivariate analysis. Based on the principles of multivariate statistics, the multivariate analysis observes and analyzes multiple variables (two or more independent variables) simultaneously to predict the value of a dependent variable for the individual subjects.
34) Explain “Normal Distribution.”
One of the popular data analyst interview questions. Normal distribution, better known as the Bell Curve or Gaussian curve, refers to a probability function that describes and measures how the values of a variable are distributed, that is, how they differ in their means and their standard deviations. In the curve, the distribution is symmetric. While most of the observations cluster around the central peak, probabilities for the values steer further away from the mean, tapering off equally in both directions.
35) Differentiate between variance and covariance.
Variance and covariance are both statistical terms. Variance depicts how distant two numbers (quantities) are in relation to the mean value. So, you will only know the magnitude of the relationship between the two quantities (how much the data is spread around the mean). On the contrary, covariance depicts how two random variables will change together. Thus, covariance gives both the direction and magnitude of how two quantities vary with respect to each other.
36) Explain what is the criteria for a good data model?
Criteria for a good data model includes
- It can be easily consumed
- Large data changes in a good model should be scalable
- It should provide predictable performance
- A good model can adapt to changes in requirements
37) Explain what is n-gram?
An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).
38) Which imputation method is more favorable?
Although single imputation is widely used, it does not reflect the uncertainty created by missing data at random. So, multiple imputation is more favorable then single imputation in case of data missing at random.
39) Explain what is imputation? List out different types of imputation techniques?
During imputation we replace missing data with substituted values. The types of imputation techniques involve are
- Single Imputation
- Hot-deck imputation: A missing value is imputed from a randomly selected similar record by the help of punch card
- Cold deck imputation: It works same as hot deck imputation, but it is more advanced and selects donors from another datasets
- Mean imputation: It involves replacing missing value with the mean of that variable for all other cases
- Regression imputation: It involves replacing missing value with the predicted values of a variable based on other variables
- Stochastic regression: It is same as regression imputation, but it adds the average regression variance to regression imputation
- Multiple Imputation
- Unlike single imputation, multiple imputation estimates the values multiple times
40) What are hash table collisions? How is it avoided?
- A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.
- To avoid hash table collision there are many techniques, here we list out two
- Separate Chaining:
- It uses the data structure to store multiple items that hash to the same slot.
- Open addressing:
- It searches for other slots using a second function and store item in first empty slot that is found
41) What is a hash table?
In computing, a hash table is a map of keys to values. It is a data structure used to implement an associative array. It uses a hash function to compute an index into an array of slots, from which desired value can be fetched.
42) Explain what is correlogram analysis?
A correlogram analysis is the common form of spatial analysis in geography. It consists of a series of estimated autocorrelation coefficients calculated for a different spatial relationship. It can be used to construct a correlogram for distance-based data, when the raw data is expressed as distance rather than values at individual points.
43) What is time series analysis?
Time series analysis can be done in two domains, frequency domain and the time domain. In Time series analysis the output of a particular process can be forecast by analyzing the previous data by the help of various methods like exponential smoothening, log-linear regression method, etc.
44) What are some of the statistical methods that are useful for data-analyst?
Statistical methods that are useful for data scientist are
- Bayesian method
- Markov process
- Spatial and cluster processes
- Rank statistics, percentile, outliers detection
- Imputation techniques, etc.
- Simplex algorithm
- Mathematical optimization
45) Explain what is Clustering? What are the properties for clustering algorithms?
Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.
Properties for clustering algorithm are
- Hierarchical or flat
- Hard and soft
46) Explain what is Map Reduce?
Map-reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.
47) Explain what is KPI, design of experiments and 80/20 rule?
- KPI: It stands for Key Performance Indicator, it is a metric that consists of any combination of spreadsheets, reports or charts about business process
- Design of experiments: It is the initial process used to split your data, sample and set up of a data for statistical analysis
- 80/20 rules: It means that 80 percent of your income comes from 20 percent of your clients
48) Explain what are the tools used in Big Data?
Tools used in Big Data includes
49) Explain what is collaborative filtering?
- Collaborative filtering is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest.
- A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that’s pops out based on your browsing history.
50) Mention what are the key skills required for Data Analyst?
A data scientist must have the following skills
- Database knowledge
- Database management
- Data blending
- Data manipulation
- Predictive Analytics
- Basic descriptive statistics
- Predictive modeling
- Advanced analytics
- Big Data Knowledge
- Big data analytics
- Unstructured data analysis
- Machine learning
- Presentation skill
- Data visualization
- Insight presentation
- Report design
51) Explain what is K-mean Algorithm?
K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.
In K-mean algorithm,
- The clusters are spherical: the data points in a cluster are centered around that cluster
- The variance/spread of the clusters is similar: Each data point belongs to the closest cluster
52) Explain what is Hierarchical Clustering Algorithm?
Hierarchical clustering algorithm combines and divides existing groups, creating a hierarchical structure that showcase the order in which groups are divided or merged.
53) Mention how to deal the multi-source problems?
To deal the multi-source problems,
- Restructuring of schemas to accomplish a schema integration
- Identify similar records and merge them into single record containing all relevant attributes without redundancy
54) What do you know about interquartile range as data analyst?
A measure of the dispersion of data that is shown in a box plot is referred to as the interquartile range. It is the difference between the upper and the lower quartile
55) What is the criteria to say whether a developed data model is good or not?
- The developed model should have predictable performance.
- A good data model can adapt easily to any changes in business requirements.
- Any major data changes in a good data model should be scalable.
- A good data model is one that can be easily consumed for actionable results.
56) How will you create a classification to identify key customer trends in unstructured data?
A model does not hold any value if it cannot produce actionable results, an experienced data analyst will have a varying strategy based on the type of data being analysed. For example, if a customer complain was retweeted then should that data be included or not. Also, any sensitive data of the customer needs to be protected, so it is also advisable to consult with the stakeholder to ensure that you are following all the compliance regulations of the organization and disclosure laws, if any.
57) Mention some common problems that data analysts encounter during analysis.
- Having a poor formatted data file. For instance, having CSV data with un-escaped newlines and commas in columns.
- Having inconsistent and incomplete data can be frustrating.
- Common Misspelling and Duplicate entries are a common data quality problem that most of the data analysts face.
- Having different value representations and misclassified data.
58) How will you handle the QA process when developing a predictive model to forecast customer churn?
Data analysts require inputs from the business owners and a collaborative environment to operationalize analytics. To create and deploy predictive models in production there should be an effective, efficient and repeatable process. Without taking feedback from the business owner, the model will just be a one-and-done model.
59) List out few best practices that are followed when it comes to data cleansing?
The best practices that are followed when it comes to data cleansing is as follows:
1. Need to define and follow a standard verification process of evaluating the data even before getting it into the database.
2. Identification and handling duplicate values so that data accuracy is always maintained
3. There is a need to develop a quality data plan which actually focuses on identifying any possible errors and also learn from the mistakes and constantly improving the plan.
60) What is data screening in data validation process?
The data screening is a process where the entire set of data is actually processed by using various algorithms to see whether we have any questionable data. This type of values is handled externally and thoroughly examined.
61) Explain few important aspects of Data analysis?
The data analysis is nothing but an in-depth study the entire data set that is available in the database.
1. First and foremost step for data analysis starts with questions and assumptions.
2. It also involves in identifying troublesome records which need to be cleaned.
3. Convey the same information to the stakeholders so that they can understand the outcome of the data analysis.
4. Study based on different regression models will help them to state an expected output.
62) Explain what does clustering mean?
The clustering is defined as a process of grouping a definite set of objects based on certain predefined parameters. This is one of the value-added data analysis technique that is used industry-wide while processing a large set of data.