35+ BEST Data Science [ Google ] Interview Questions & Answers
Last updated on 04th Jul 2020, Blog, Interview Questions
These Google Data Science Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Google Data Science . As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer.we are going to cover top 100 Google Data Science Interview questions along with their detailed answers. We will be covering Google Data Science scenario based interview questions, Google Data Science interview questions for freshers as well as Google Data Science interview questions and answers for experienced.
Q1. How are missing values and impossible values represented in R?
One of the main issues, when working with real data is handling missing values. These are represented by NA in R. Impossible values (division by 0, for example) are represented by NAN(not a number).
Q2. How do you explain Random Forrest to a non-technical person?
Random Forest is a classification algorithm. Its main purpose is to match a specific observation with its observed outcome.An important defining characteristic of a random forest is that it is simply a collection of decision trees. There are many terms involved, but in fact, the concept is rather simple and could be easily illustrated with an example.
Q3. What’s wrong with training and testing a machine learning model on the same data?
This is one of the more common data scientist interview questions. When we are training a model, we are exposing it to the ‘training data’. This means it is learning the patterns from it. By the end of the training, it becomes very good at predicting this particular dataset. However, sometimes we may overfit. This is a situation where we keep improving the accuracy, but not because the model is good, but just because it has learned every little detail about the data it is given.
Q4. How to make sure you are not overfitting while training a model?
it doesn’t look for the general patterns, but for the noise in the data provided. If that happens, when provided with new data, the model behaves disastrously in a real-life setting.
Regularization – In the context of machine learning refers to the process of modifying a learning algorithm so as to make it simpler often to prevent overfitting or to solve a badly posed problem.
- Early stopping – early stopping is the most common type of regularization. It is designed precisely to prevent overfitting. It consists of techniques that interrupt the training process, once the model starts overfitting.
Here you may be expected to say ‘validation’ or ‘cross-validation’. In fact, early stopping methods always use the outputs from the validation to determine whether to stop the training process.
- Feature selection – for some models, having useless input features leads to much worse performance. Therefore, you have to make sure to choose only the most relevant features for your problem otherwise this may affect (among other things) overfitting.
- Ensembles are methods to combine several base models in order to produce one optimal predictive model. A good example of the ensemble method is Random Forest (a collection of decision trees).
Q5. What is cross-validation? How to do it right?
Cross-validation refers to many model validation techniques that use the same dataset for both training and validation. Usually, it is on a rotational basis so that observations are not overexposed to the training process and thus can serve as better validation. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.
Q6. How do you create a table in R without using external files?
create a table from scratch, you can use any of the random generator functions in R to generate random numbers according to a distribution, and store them in a matrix or a data frame. The functions are:
Q7. Explain the significance of Transpose in R?
Transpose is one of the simplest ways you can reshape a data structure in R. If you transpose a data frame or a matrix, you will essentially be rotating the data, so rows become columns, and vice versa.
Q8. Why would you use a Null as a data value?
it’s important not to confuse a NULL value with the value of 0 or with a “NONE” response. Instead, think of a null value as a missing value. 0 or “NONE” could be values assigned by the user, while “NULL” is a value assigned by the computer if the user has provided no value for a given record.
Q9. What is a primary key and a foreign key?
- A primary key is a column (or a set of columns) whose value exists and is unique for every record in a table. It’s important to know that each table can have one and only one primary key.
- A foreign key, instead, is a column (or a set of columns) that references a column (most often the primary key) of another table. Foreign keys can be called identifiers, too, but they identify the relationships between tables, not the tables themselves.
Q10. Given a table with duplicate data, how would you extract only specific rows based on business requirements provided?
In most cases, the tools form the Data Manipulation Language (DML) will allow you to do that. Usually, you could either use a SELECT DISTINCT statement to select distinct rows only or apply a GROUP BY clause to a join to filter the data in the desired way.
Q11. A box has 12 red cards and 12 black cards. Another box has 24 red cards and 24 black cards. You want to draw two cards at random from one of the two boxes, one card at a time. Which box has a higher probability of getting cards of the same color and why?
The box with 24 red cards and 24 black cards has a higher probability of getting two cards of the same color. Let’s walk through each step.Let’s say the first card you draw from each deck is a red Ace.This means that in the deck with 12 reds and 12 blacks, there’s now 11 reds and 12 blacks. Therefore your odds of drawing another red are equal to 11/(11+12) or 11/23.In the deck with 24 reds and 24 blacks, there would then be 23 reds and 24 blacks. Therefore your odds of drawing another red are equal to 23/(23+24) or 23/47.Since 23/47 > 11/23, the second deck with more cards has a higher probability of getting the same two cards.
Q12. You are at a Casino and have two dices to play with. You win $10 every time you roll a 5. If you play till you win and then stop, what is the expected payout?
- Let’s assume that it costs $5 every time you want to play.
- There are 36 possible combinations with two dice.
- Of the 36 combinations, there are 4 combinations that result in rolling a five (see blue). This means that there is a 4/36 or 1/9 chance of rolling a 5.
- A 1/9 chance of winning means you’ll lose eight times and win once (theoretically).
- Therefore, your expected payout is equal to $10.00 * 1 – $5.00 * 9= -$35.00.
Q13. How can you tell if a given coin is biased?
This isn’t a trick question. The answer is simply to perform a hypothesis test:
- The null hypothesis is that the coin is not biased and the probability of flipping heads should equal 50% (p=0.5). The alternative hypothesis is that the coin is biased and p != 0.5.
- Flip the coin 500 times.
- Calculate Z-score (if the sample is less than 30, you would calculate the t-statistics).
- Compare against alpha (two-tailed test so 0.05/2 = 0.025).
- If p-value > alpha, the null is not rejected and the coin is not biased.
If p-value < alpha, the null is rejected and the coin is biased.
Q14. How to Make an unfair coin fair?
- Since a coin flip is a binary outcome, you can make an unfair coin fair by flipping it twice. If you flip it twice, there are two outcomes that you can bet on: heads followed by tails or tails followed by heads.
- P(heads) * P(tails) = P(tails) * P(heads)
- This makes sense since each coin toss is an independent event. This means that if you get heads → heads or tails → tails, you would need to reflip the coin.
Q15. You are about to get on a plane to London, you want to know whether you have to bring an umbrella or not. You call three of your random friends and ask each one of them if it’s raining. The probability that your friend is telling the truth is 2/3 and the probability that they are playing a prank on you by lying is 1/3. If all 3 of them tell that it is raining, then what is the probability that it is actually raining in London?
You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s 25%.
- P(A) = probability of it raining = 25%
- P(B) = probability of all 3 friends say that it’s raining
- P(A|B) probability that it’s raining given they’re telling that it is raining
- P(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27
- Step 1: Solve for P(B)
P(A|B) = P(B|A) * P(A) / P(B), can be rewritten as
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B) = (2/3)³ * 0.25 + (1/3)³ * 0.75 = 0.25*8/27 + 0.75*1/27
- Step 2: Solve for P(A|B)
P(A|B) = 0.25 * (8/27) / ( 0.25*8/27 + 0.75*1/27)
P(A|B) = 8 / (8 + 3) = 8/11
Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.
Q16. You are given 40 cards with four different colors- 10 Green cards, 10 Red Cards, 10 Blue cards, and 10 Yellow cards. The cards of each color are numbered from one to ten. Two cards are picked at random. Find out the probability that the cards picked are not of the same number and same color?
Since these events are not independent, we can use the rule:
P(A and B) = P(A) * P(B|A) ,which is also equal to
P(not A and not B) = P(not A) * P(not B | not A)
P(not 4 and not yellow) = P(not 4) * P(not yellow | not 4)
P(not 4 and not yellow) = (36/39) * (27/36)
P(not 4 and not yellow) = 0.692
Therefore, the probability that the cards picked are not the same number and the same color is 69.2%.
Q17.Can you enumerate the various differences between Supervised and Unsupervised Learning?
Supervised learning is a type of machine learning where a function is inferred from labeled training data. The training data contains a set of training examples.
Unsupervised learning, on the other hand, is a type of machine learning where inferences are drawn from datasets containing input data without labeled responses. Following are the various other differences between the two types of machine learning:
- Algorithms Used – Supervised learning makes use of Decision Trees, K-nearest Neighbor algorithm, Neural Networks, Regression, and Support Vector Machines. Unsupervised learning uses Anomaly Detection, Clustering, Latent Variable Models, and Neural Networks.
- Enables – Supervised learning enables classification and regression, whereas unsupervised learning enables classification, dimension reduction, and density estimation
- Use – While supervised learning is used for prediction, unsupervised learning finds use in analysis
Q18. What do you understand by the Selection Bias? What are its various types?
Selection bias is typically associated with research that doesn’t have a random selection of participants. It is a type of error that occurs when a researcher decides who is going to be studied. On some occasions, selection bias is also referred to as the selection effect.Following are the various types of selection bias:
- Sampling Bias – A systematic error resulting due to a non-random sample of a populace causing certain members of the same to be less likely included than others that results in a biased sample.
- Time Interval – A trial might be ended at an extreme value, usually due to ethical reasons, but the extreme value is most likely to be reached by the variable with the most variance, even though all variables have a similar mean.
- Data – Results when specific data subsets are selected for supporting a conclusion or rejection of bad data arbitrarily.
- Attrition – Caused due to attrition, i.e. loss of participants, discounting trial subjects or tests that didn’t run to completion.
Q19. Please explain the goal of A/B Testing?
A/B Testing is a statistical hypothesis testing meant for a randomized experiment with two variables, A and B. The goal of A/B Testing is to maximize the likelihood of an outcome of some interest by identifying any changes to a webpage.A highly reliable method for finding out the best online marketing and promotional strategies for a business, A/B Testing can be employed for testing everything, ranging from sales emails to search ads and website copy.
Q20. How will you calculate the Sensitivity of machine learning models?
- In machine learning, Sensitivity is used for validating the accuracy of a classifier, such as Logistic, Random Forest, and SVM. It is also known as REC (recall) or TPR (true positive rate).
- Sensitivity can be defined as the ratio of predicted true events and total events i.e.:
- Sensitivity = True Positives / Positives in Actual Dependent Variable
- Here, true events are those events that were true as predicted by a machine learning model. The best sensitivity is 1.0 and the worst sensitivity is 0.0.
Q21. Could you draw a comparison between overfitting and underfitting?
In order to make reliable predictions on general untrained data in machine learning and statistics, it is required to fit a (machine learning) model to a set of training data. Overfitting and underfitting are two of the most common modeling errors that occur while doing so.
Following are the various differences between overfitting and underfitting:
- Definition – A statistical model suffering from overfitting describes some random error or noise in place of the underlying relationship. When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data.
- Occurrence – When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. Example of a complex model is one having too many parameters when compared to the total number of observations. Underfitting occurs when trying to fit a linear model to non-linear data.
- Poor Predictive Performance – Although both overfitting and underfitting yield poor predictive performance, the way in which each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.
Q22. Between Python and R, which one would you pick for text analytics and why?
For text analytics, Python will gain an upper hand over R due to these reasons:
- The Pandas library in Python offers easy-to-use data structures as well as high-performance data analysis tools
- Python has a faster performance for all types of text analytics
- R is a best-fit for machine learning than mere text analysis
Q23. Please explain the role of data cleaning in data analysis.
Data cleaning can be a daunting task due to the fact that with the increase in the number of data sources, the time required for cleaning the data increases at an exponential rate.This is due to the vast volume of data generated by additional sources. Also, data cleaning can solely take up to 80% of the total time required for carrying out a data analysis task.
Nevertheless, there are several reasons for using data cleaning in data analysis. Two of the most important ones are:
- Cleaning data from different sources helps in transforming the data into a format that is easy to work with
- Data cleaning increases the accuracy of a machine learning model
Q24. What do you mean by cluster sampling and systematic sampling?
When studying the target population spread throughout a wide area becomes difficult and applying simple random sampling becomes ineffective, the technique of cluster sampling is used. A cluster sample is a probability sample, in which each of the sampling units is a collection or cluster of elements.
Q25. Please explain Eigenvectors and Eigenvalues?
Eigenvectors help in understanding linear transformations. They are calculated typically for a correlation or covariance matrix in data analysis.Eigenvalues can be understood either as the strengths of the transformation in the direction of the eigenvectors or the factors by which the compressions happens.
Q26. Can you compare the validation set with the test set?
A validation set is part of the training set used for parameter selection as well as for avoiding overfitting of the machine learning model being developed. On the contrary, a test set is meant for evaluating or testing the performance of a trained machine learning model.
Q27. What do you understand by linear regression and logistic regression?
Linear regression is a form of statistical technique in which the score of some variable Y is predicted on the basis of the score of a second variable X, referred to as the predictor variable. The Y variable is known as the criterion variable.Also known as the logit model, logistic regression is a statistical technique for predicting the binary outcome from a linear combination of predictor variables.
Q28. Please explain Recommender Systems along with an application.
Recommender Systems is a subclass of information filtering systems, meant for predicting the preferences or ratings awarded by a user to some product.
An application of a recommender system is the product recommendations section in Amazon. This section contains items based on the user’s search history and past orders.
Q29. What are outlier values and how do you treat them?
Outlier values, or simply outliers, are data points in statistics that don’t belong to a certain population. An outlier value is an abnormal observation that is very much different from other values belonging to the set.Identification of outlier values can be done by using univariate or some other graphical analysis method. Few outlier values can be assessed individually but assessing a large set of outlier values require the substitution of the same with either the 99th or the 1st percentile values.
There are two popular ways of treating outlier values:
- To change the value so that it can be brought within a range
- To simply remove the value
Q30. Please enumerate the various steps involved in an analytics project?
Following are the numerous steps involved in an analytics project:
- Understanding the business problem
- Exploring the data and familiarizing with the same
- Preparing the data for modeling by means of detecting outlier values, transforming variables, treating missing values, et cetera
- Running the model and analyzing the result for making appropriate changes or modifications to the model (an iterative step that repeats until the best possible outcome is gained)
- Validating the model using a new dataset
- Implementing the model and tracking the result for analyzing the performance of the same
Learn Hands-on Experience from Google Data Science Training CourseWeekday / Weekend BatchesSee Batch Details
Q31. Could you explain how to define the number of clusters in a clustering algorithm?
The primary objective of clustering is to group together similar identities in such a way that while entities within a group are similar to each other, the groups remain different from one another.Generally, Within Sum of Squares is used for explaining the homogeneity within a cluster. For defining the number of clusters in a clustering algorithm, WSS is plotted for a range pertaining to a number of clusters. The resultant graph is known as the Elbow Curve.The Elbow Curve graph contains a point that represents the point post in which there aren’t any decrements in the WSS. This is known as the bending point and represents K in K–Means.Although the aforementioned is the widely-used approach, another important approach is the Hierarchical clustering. In this approach, dendrograms are created first and then distinct groups are identified from there.
Q32.Question: What do you understand by Deep Learning?
Deep Learning is a paradigm of machine learning that displays a great degree of analogy with the functioning of the human brain. It is a neural network method based on convolutional neural networks (CNN).Deep learning has a wide array of uses, ranging from social network filtering to medical image analysis and speech recognition. Although Deep Learning has been present for a long time, it’s only recently that it has gained worldwide acclaim. This is mainly due to:
- An increase in the amount of data generation via various sources
- The growth in hardware resources required for running Deep Learning modelsCaffe, Chainer, Keras, Microsoft Cognitive Toolkit, Pytorch, and TensorFlow are some of the most popular Deep Learning frameworks as of today.
Q33. Please explain Gradient Descent?
The degree of change in the output of a function relating to the changes made to the inputs is known as a gradient. It measures the change in all weights with respect to the change in error. A gradient can also be comprehended as the slope of a function.Gradient Descent refers to escalating down to the bottom of a valley. Simply, consider this something as opposed to climbing up a hill. It is a minimization algorithm meant for minimizing a given activation function.
Q34. How does Backpropagation work? Also, it state its various variants?
Backpropagation refers to a training algorithm used for multilayer neural networks. Following the backpropagation algorithm, the error is moved from an end of the network to all weights inside the network. Doing so allows for efficient computation of the gradient.
Back propagation works in the following way:
- Forward propagation of training data
- Output and target is used for computing derivatives
- Backpropagate for computing the derivative of the error with respect to the output activation
- Using previously calculated derivatives for output generation
- Updating the weights
Following are the various variants of Back propagation:
- Batch Gradient Descent – The gradient is calculated for the complete dataset and update is performed on each iteration
- Mini-batch Gradient Descent – Mini-batch samples are used for calculating gradient and updating parameters (a variant of the Stochastic Gradient Descent approach)
- Stochastic Gradient Descent – Only a single training example is used to calculate gradient and updating parameters
Q35. What do you know about Autoencoders?
Autoencoders are simplistic learning networks used for transforming inputs into outputs with minimum possible error. It means that the outputs resulted are very close to the inputs.A couple of layers are added between the input and the output with the size of each layer smaller than the size pertaining to the input layer. An autoencoder receives unlabeled input that is encoded for reconstructing the output.
Q36. What motivates you about this position?
By asking this question, the recruiter wants to understand whether you are excited about the new opportunity that lies ahead of you. Your enthusiasm, of course, is highly correlated with the amount of effort you will put once the job is offered.
A motivated person would try to be proactive and create a positive working environment, which is precisely what every company needs. The real question isn’t whether you should say that you are motivated. Of course, you should. You need to think of a way that would best show that you are genuinely interested in the position under consideration. There are a lot of different things that can motivate you:
- The learning opportunities that you will have on the job
- Future growth prospects
- You like the team that you will be inserted in (if you have met them)
- You share the company’s values/mission
- The company operates in a dynamic, ever-changing industry
- The company’s prestige
Q37. Give me an example of a time when you had to go the extra mile?
“The only way to do great work is to love what you do”
Steve Jobs:Going the extra-mile is rarely a one-time act. More often, it is an ingrained habit. You need to properly explain to your recruiter that you love the idea of working that job. Also, explain how you want to be excellent at it. Your internal drive towards excellence is what motivates you to go the extra mile – to do the things that you are not expected to do:
- Study during the weekends
- Stay late in the office
- Striving for excellence constantly
If the job you are interviewing for is what you chose for your life, then you want to be excellent at it. Striving to achieve excellent performance is important. It means that you want to put quality in your work and create value for the company. Internal drive is probably the best reason to go the extra mile; you are willing to do what is necessary in order to be good at what you do.
Q38. What is Dropout in Data Science?
Dropout is a toll in Data Science, which is used for dropping out the hidden and visible units of a network on a random basis. They prevent the overfitting of the data by dropping as much as 20% of the nodes so that the required space can be arranged for iterations needed to converge the network.
Q39. What is Batch normalization in Data Science?
Batch Normalization in Data Science is a technique through which attempts could be made to improve the performance and stability of the neural network. This can be done by normalizing the inputs in each layer so that the mean output activation remains 0 with the standard deviation at 1.
Q40. What is the difference between Batch and Stochastic Gradient Descent?
The difference between Batch and Stochastic Gradient Descent can be displayed as follows:
|Batch Gradient Descent||Stochastic Gradient Descent|
|It helps in computing the gradient using the complete data set available.||It helps in computing the gradient using only the single sample.|
|It takes time to converge.||It takes less time to converge.|
|The volume is huge for analysis purpose||The volume is lesser for analysis purposes.|
|It updates the weight slowly.||It updates the weight more frequently.|
Q41. What are Auto-Encoders?
Auto-Encoders are learning networks that are meant to change inputs into output with the lowest chance of getting an error. They intend to keep the output closer to the input. The process of Autoencoders is needed to be done through the development of layers between the input and output. However, efforts are made to keep the size of these layers smaller for faster processing.
Q42. What are the various Machine Learning Libraries and their benefits?
The various machine learning libraries and their benefits are as follows.
- Numpy: It is used for scientific computation.
- Statsmodels: It is used for time-series analysis.
- Pandas: It is used for tubular data analysis.
- Scikit learns: It is used for data modeling and pre-processing.
- Tensorflow: It is used for the deep learning process.
- Regular Expressions: It is used for text processing.
- Pytorch: It is used for the deep learning process.
- NLTK: It is used for text processing.
Q43. What is an Activation function?
An Activation function helps in introducing the non-linearity in the neural network. This is done to help the learning process for complex functions. Without the activation function, the neural network will be unable to perform only the linear function and apply linear combinations. Activation function, therefore, offers complex functions and combinations by applying artificial neurons, which helps in delivering output based on the inputs.
Q44. What are the different types of Deep Learning Frameworks?
The different types of Deep Learning Framework includes the following:
- Microsoft Cognitive Toolkit
Q45. What are vanishing gradients?
The vanishing gradients is a condition when the slope is too small during the training process of RNN. The result of vanishing gradients is poor performance outcomes, low accuracy, and long term training processes.
Q46. What are exploding gradients?
The exploding gradients are a condition when the errors grow at an exponential rate or high rate during the training of RNN. This error gradient accumulates and results in applying large updates to the neural network, causes an overflow, and results in NaN values.
Q47. What is the full form of LSTM? What is its function?
LSTM stands for Long Short Term Memory. It is a recurrent neural network that is capable of learning long term dependencies and recalling information for the longer period as part of its default behavior.
Q48. What are the different steps in LSTM?
The different steps in LSTM include the following.
- Step 1: The network helps in deciding the things that need to be remembered while others that need to be forgotten.
- Step 2: The selection is made for cell state values that can be updated.
- Step 3: The network decides as to what can be made as part of the current output.
Q49. What is Pooling on CNN?
Polling is a method that is used with the purpose to reduce the spatial dimensions of a CNN. It helps in performing downsampling operations for reducing dimensionality and creating pooled feature maps. Pooling in CNN helps in sliding the filter matrix over the input matrix.
Q50. What is RNN?
The RNN stands for Recurrent Neural Networks. They are an artificial neural network that is a sequence of data, including stock markets, sequence of data including stock markets, time series, and various others. The main idea behind the RNN application is to understand the basics of the feedforward nets.
Enroll in Google Data Science Training from Expert Instructors
- Instructor-led Sessions
- Real-life Case Studies
Q51. What are the different layers on CNN?
There are four different layers on CNN. These include the following.
- Convolutional Layer: In this layer, several small picture windows are created to go over the data.
- ReLU Layer: This layer helps in bringing non-linearity to the network and converts the negative pixels to zero so that the output becomes a rectified feature map.
- Pooling Layer: This layer reduces the dimensionality of the feature map.
- Fully Connected Layer: This layer recognizes and classifies the objects in the image.
Q52. Imagine you’re in a room with 3 light switches. In the next room, there are 3 light bulbs, each controlled by one of the switches. You have to find out which switch controls each bulb by checking the room just once. Keep in mind that all lights are initially off, and you can’t see into 1 room from the other. So, how can you figure out which switch is connected to which light bulb?
Let’s say we have switches 1, 2, and 3. What you can do is leave switch 1 off, turn switch 2 on for 5 minutes, and then turn it off. Then turn switch 3 on and leave it like that. Then you enter the room. Obviously, switch 3 controls the light bulb you left on. The bulb that is off but still warm, is controlled by switch 2. And switch one controls the light bulb you never turned on.
Q53. How many square feet of pizza are eaten in the United States each month?
Let’s say there are roughly 300 million people in America, out of which 200 million eat pizza. Now, suppose the average pizza-eater has pizza twice a month and eats two slices at a time. That makes four slices per month. Тhe usual slice of pizza is about six inches at the base and 10 inches long. That means the slice is 30 square inches of pizza. Consequently, four slices of pizza would amount to 120 square inches. We know that one square foot equals 144 square inches, we can say that each pizza-eater consumes one square foot per month. And, as there are 200 million pizza-eaters in America, we can conclude that 200 million square feet of pizza are consumed in the US each month.
Q54. Please explain the concept of a Boltzmann Machine?
A Boltzmann Machine features a simple learning algorithm that enables the same to discover fascinating features representing complex regularities present in the training data. It is basically used for optimizing the quantity and weight for some given problem.
Q55. What are the skills required as a Data Scientist that could help in using Python for data analysis purposes?
The skills required as a Data Scientist that could help in using Python for data analysis purposes are stated under:
- Expertize in Pandas Dataframes, Scikit-learn, and N-dimensional NumPy Arrays.
- Skills to apply element-wise vector and matrix operations on NumPy arrays.
- Able to understand built-in data types, including tuples, sets, dictionaries, and various others.
- It is equipped with Anaconda distribution and the Conda package manager.
- Capability in writing efficient list comprehensions, small, clean functions, and avoid traditional for loops.
- Knowledge of Python script and optimizing bottlenecks
Q56. What is the full form of GAN? Explain GAN?
The full form of GAN is Generative Adversarial Network. Its task is to take inputs from the noise vector and send it forward to the Generator and then to Discriminator to identify and differentiate the unique and fake inputs.
Q57. What are the vital components of GAN?
There are two vital components of GAN. These include the following:
- Generator: The Generator act as a Forger, which creates fake copies.
- Discriminator: The Discriminator act as a recognizer for fake and unique (real) copies.
Q58. What is the Computational Graph?
A computational graph is a graphical presentation that is based on TensorFlow. It has a wide network of different kinds of nodes wherein each node represents a particular mathematical operation. The edges in these nodes are called tensors. This is the reason the computational graph is called a TensorFlow of inputs. The computational graph is characterized by data flows in the form of a graph; therefore, it is also called the DataFlow Graph.
Q59. What are tensors?
Tensors are the mathematical objects that represent the collection of higher dimensions of data inputs in the form of alphabets, numerals, and rank fed as inputs to the neural network.
Q60. Why are Tensorflow considered a high priority in learning Data Science?
Tensorflow is considered a high priority in learning Data Science because it provides support to using computer languages such as C++ and Python. This way, it makes various processes under data science to achieve faster compilation and completion within the stipulated time frame and faster than the conventional Keras and Torch libraries. Tensorflow supports the computing devices, including the CPU and GPU for faster inputs, editing, and analysis of the data.
Q61. What is an Epoch in Data Science?
Epoch in Data Science represents one of the iterations over the entire dataset. It includes everything that is applied to the learning model.
Q62. What is a Batch in Data Science?
Batch is referred to as a different dataset that is divided into the form of different batches to help to pass the information into the system. It is developed in the situation when the developer cannot pass the entire dataset into the neural network at once.
Q63. What is the iteration in Data Science? Give an example?
Iteration in Data Science is applied by Epoch for analysis of data. The iteration is, therefore, classification of the data into different groups. For example, when there are 50,000 images, and the batch size is 100, then in such a case, the Epoch will run about 500 iterations.
Q64. What is the cost function?
Cost functions are a tool to evaluate how good the model performance has been made. It takes into consideration the errors and losses that are made in the output layer during the backpropagation process. In such a case, the errors are moved backward in the neural network, and various other training functions are applied.
Q65. What are hyperparameters?
Hyperparameter is a kind of parameter whose value is set before the learning process so that the network training requirements can be identified and the structure of the network can be improved. This process includes recognizing the hidden units, learning rate, epochs, and various others associated.
Q66. Which skills are important to become a certified Data Scientist?
The important skills to become a certified Data Scientist include the following:
- Knowledge of built-in data types including lists, tuples, sets and related.
- Expertize in N-dimensional NumPy Arrays.
- Ability to apply Pandas Dataframes.
- Strong hold over performance in element wise vectors.
- Knowledge of matrix operations on NumPy arrays.
Q67. What is an Artificial Neural Network in Data Science?
Artificial Neural Network in Data Science is the specific set of algorithms that are inspired by the biological neural network meant to adapt the changes in the input so that the best output can be achieved. It helps in generating the best possible results without the need to redesign the output methods.
Q68. What is Deep Learning in Data Science?
Deep Learning in Data Science is a name given to machine learning, which requires a great level of analogy with the functioning of the human brain. This way, it is a paradigm of machine learning.
Q69. Are there differences between Deep Learning and Machine Learning?
Yes, there are differences between Deep Learning and Machine learning. These are stated as under:
|Deep Learning||Machine Learning|
|It gives computers the ability to learn without being explicitly programmed.||It gives computers a limited to unlimited ability wherein nothing major can be done without getting programmed, and many things can be done without the prior programming. It includes supervised, unsupervised, and reinforcement machine learning processes.|
|It is a subcomponent of machine learning that is concerned with algorithms that are inspired by the structure and functions of the human brains called the Artificial Neural Networks.||It includes Deep Learning as one of its components.|
Q70. What is Ensemble learning?
Ensemble learning is a process of combining the diverse set of learners that is the individual models with each other. It helps in improving the stability and predictive power of the model.
Q71. What are the different kinds of Ensemble learning?
The different kinds of Ensemble learning includes the following.
- Bagging: It implements simple learners on one small population and takes mean for estimation purposes.
- Boosting: It adjusts the weight of the observation and thereby classifies the population in different sets before the outcome prediction is made.
Q72. What are the differences between supervised and unsupervised learning?
|Supervised Learning||Unsupervised Learning|
|Uses known and labeled data as inputSupervised learning has a feedback mechanism Most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine||Uses unlabeled data as inputUnsupervised learning has no feedback mechanism Most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm|
Q73. How is logistic regression done?
Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).
Q74. Explain the steps in making a decision tree?
- Take the entire data set as input
- Calculate entropy of the target variable, as well as the predictor attributes
- Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
- Choose the attribute with the highest information gain as the root node
- Repeat the same procedure on every branch until the decision node of each branch is finalized
Q75. How do you build a random forest model?
A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.
Steps to build a random forest model:
- Randomly select ‘k’ features from a total of ‘m’ features where k << m
- Among the ‘k’ features, calculate the node D using the best split point
- Split the node into daughter nodes using the best split
- Repeat steps two and three until leaf nodes are finalized
- Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees
Q76. How can you avoid the overfitting your model?
Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:
- Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
- Use cross-validation techniques, such as k folds cross-validation
- Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting
Q77. Differentiate between univariate, bivariate, and multivariate analysis?
- Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.
- Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.
- Multivariate data involves three or more variables, it is categorized under multivariate. It is similar to a bivariate but contains more than one dependent variable.
Q78. What are the feature selection methods used to select the right variables?
There are two main methods for feature selection:
- Filter Methods
- Wrapper Methods
Q79. You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
The following are ways to handle missing data values:
- If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values.
- For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).
Q80. For the given points, how will you calculate the Euclidean distance in Python?
plot1 = [1,3]
plot2 = [2,5]
The Euclidean distance can be calculated as follows:
euclidean_distance = sqrt( (plot1-plot2)**2 + (plot1-plot2)**2 )
Q81. What are dimensionality reduction and its benefits?
Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely. This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).
Q82. How should you maintain a deployed model?
The steps to maintain a deployed model are:
- Monitor :Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it’s doing what it’s supposed to do.
- Evaluate: Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.
- Compare :The new models are compared to each other to determine which model performs the best.
- Rebuild :The best performing model is re-built on the current state of data.
Q83. What are recommender systems?
A recommender system predicts what a user would rate a specific product based on their preferences. It can be split into two different areas:
- Collaborative filtering
- Content-based filtering
Q84. How do you find RMSE and MSE in a linear regression model?
RMSE and MSE are two of the most common measures of accuracy for a linear regression model.
- RMSE indicates the Root Mean Square Error.
- MSE indicates the Mean Square Error.
Q85. How can you select k for k-means?
The elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where ‘k’ is the number of clusters.Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.
Q86. What is the significance of p-value?
- p-value typically ≤ 0.05.This indicates strong evidence against the null hypothesis; so you reject the null hypothesis.
- p-value typically > 0.05.This indicates weak evidence against the null hypothesis, so you accept the null hypothesis.
- p-value at cutoff 0.05. This is considered to be marginal, meaning it could go either way.
Q87. How can a time-series data be declared as stationery?
It is stationary when the variance and mean of the series are constant with time.
Q88. How can you calculate accuracy using a confusion matrix?
The formula for accuracy is:
Accuracy = (True Positive + True Negative) / Total Observations
= (262 + 347) / 650
= 609 / 650
Q89. Write a basic SQL query that lists all orders with customer information?
order tables and customer tables that contain the following columns:
The SQL query is:
SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country
ON Order.CustomerId = Customer.Id
Q90. You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96 percent. Why shouldn’t you be happy with your model performance? What can you do about it?
Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection, and can greatly improve a patient’s prognosis.Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class wise performance of the classifier.
Q91. Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables?
- K-means clustering
- Linear regression
- K-NN (k-nearest neighbor)
- Decision trees
Q92. We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case?
algorithm for this case logistic regression.
Q93. After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study?
Q94. Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decisions. Which analysis method should you use?
Q95. What are the feature vectors?
A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyze.
Q96. What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.
Q97. What is logistic regression?
Logistic regression is also known as the logit model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.
Q98. What is collaborative filtering?
Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.
Q99. What are the confounding variables?
These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.