# SAS Predictive Modeling Interview Questions and Answers [ FRESHERS ]

Last updated on 14th Nov 2021, Blog, Interview Questions

If you’re looking for SAS Predictive Modeling Interview Questions for Experienced or Freshers, you are at the right place. There are a lot of opportunities from many reputed companies in the world. According to research, SAS Predictive Modeling has a market share of about 0.8%. So, You still have the opportunity to move ahead in your career as a SAS Predictive Modeling Analyst. ACTE offers Advanced SAS Predictive Modeling Interview Questions 2021 that helps you in cracking your interview & acquire your dream career as SAS Predictive Modeler.

**1.What is Predictive Modelling? **

**Ans:**

Predictive modeling knowledge is one of the most sought-after skill today. It is in demand these days. It is being used in almost every domain ranging from finance, retail to manufacturing. It is being looked as a method of solving complex business problems. It helps to grow businesses e.g. predictive acquisition model, optimization engine to solve network problem etc.

**2. What are the essential steps in a predictive modeling project? **

**Ans:**

- Establish business objective of a predictive model.
- Pull Historical Data – Internal and External.
- Select Observation and Performance Window.
- Create newly derived variables.
- Split Data into Training, Validation and Test Samples.
- Clean Data – Treatment of Missing.
- Values and Outliers.
- Variable Reduction / Selection.
- Variable Transformation.
- Develop Model.
- Validate Model.
- Check Model Performance.
- Deploy Model.
- Monitor Model

It consists of the following steps : –

**3.Explain the problem statement of your project. **

**Ans:**

A problem statement is usually one or two sentences to explain the problem your process improvement project will address. In general, a problem statement will outline the negative points of the current situation and explain why this matters.

**4.Difference between Linear and Logistic Regression? **

**Ans:**

- Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary – two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories.
- Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood).

Two main differences are as follows: –

**5.How to treat outliers? **

**Ans:**

- Percentile Capping.
- Box-Plot Method.
- Mean plus minus 3 Standard Deviation.
- Weight of Evidence.

There are several methods to treat outliers: –

**6.What is multi co-linearity and how to deal it? **

**Ans:**

Multi co-linearity implies high correlation between independent variables. It is one of the assumptions in linear and logistic regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate co-linearity issue. VIF >5 is considered as high co-linearity.It can be handled by iterative process: first step – remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5

**7.Explain co-linearity between continuous and categorical variables? **

**Ans:**

Co-linearity between categorical and continuous variables is very common. The choice of reference category for dummy variables affects multi co-linearity. It means changing the reference category of dummy variables can avoid co-linearity. Pick a reference category with highest proportion of cases.

**8.What are the applications of predictive modeling? **

**Ans:**

- Acquisition – Cross Sell / Up Sell.
- Retention – Predictive Attrition Model.
- Customer Lifetime Value Model.
- Next Best Offer.
- Market Mix Model.
- Pricing Model.
- Campaign Response Model.
- Probability of Customers defaulting on loan.
- Segment customers based on their homogenous attributes.
- Demand Forecasting.
- Usage Simulation.
- Underwriting.
- Optimization – Optimize Network.

Predictive modeling is mostly used in the following areasareas: –

**9.Is VIF a correct method to compute co-linearity in this case? **

**Ans:**

VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check co-linearity between continuous and dummy variable.

**10.Difference between Factor Analysis and PCA? **

**Ans:**

- In Principal Components Analysis, the components are calculated as linear combinations of the original variables. In Factor Analysis, the original variables are defined as linear combinations of the factors.
- Principal Components Analysis is used as a variable reduction technique whereas Factor Analysis is used to understand what constructs underlie the data.
- In Principal Components Analysis, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the co-variances or correlations between the variables.

The main 3 difference between these two techniques are as follows: –

**11.What are the effective measures in a Predictive Modeling project? **

**Ans:**

- Set up the business target.
- Examine historical data both internal and external.
- Adopt Observation and Performance window.
- Make recently determining factors Categorize Data into training, validation, and test Samples.
- Clean Data – Treatment of missing values and outliers.
- Variable reduction/selection.
- Variable transformation.
- Create model.
- Approve model.
- Check model performance.

It comprises the following steps: –

**12.Contrast amongst Linear and Logistic Regression? **

**Ans:**

- In Linear Regression, the dependent variable needs to be continuous i.e. without any breakage. The variable can be as big as possible as long as it’s not split. Whereas in Logistic Regression, the dependent variable ought to be binary i.e. either 0 or 1. Although, the dependent variable can have more than two categories when it comes to multinomial or ordinary logistic regression.
- Least Square Estimation takes prominence in Linear Regression, which is basically whatever coefficients are chosen should minimize the sum of the squared distances. Maximum Likelihood Estimation for Logistic Regression on the other hand prompts that the coefficients chosen should yield in maximum probability for Y given X.

Two primary contrasts are as per the following: –

**13.What do you mean by SAP security? **

**Ans:**

SAP security is furnishing the right access to business clients as for their power or duty and giving authorization as indicated by their parts.

**14.Clarify what is “roles” in SAP security? **

**Ans:**

“Roles” is alluded to a gathering of t-codes, which is appointed to execute specific business errands. Every part in SAP requires specific benefits to executing a capacity in SAP that is called AUTHORIZATIONS.

**15.What are the pre-requirements that ought to be taken before allotting Sap_all to a client even if there is a nod from the authority? **

**Ans:**

- Empowering the review log-utilizing sm 19 tcode.
- Recovering the review log-utilizing sm 20 tcode.

The pre-requisites are: –

**16.Clarify what is SOD in SAP Security? **

**Ans:**

SOD implies Segregation of Duties; it is executed in SAP so as to recognize and avert blunder or misrepresentation amid the business exchange. For instance, if a client or worker has the benefit to get to financial balance detail and installment run, it may be conceivable that it can redirect seller installments to his own record.

**17.What are role templates used for? **

**Ans:**

Role templates comprise transactions, web addresses, and reports. These are predefined activity bots in SAP.

**18.Is it conceivable to change the role template? How? **

**Ans:**

- Indeed, we can change a client role template. There are precisely three manners by which we can work with client role templates.
- We can utilize it as they are conveyed in sap.
- We can alter them according to our requirements through pfcg.
- We can make them without any preparation.
- For all the above indicated we need to utilize pfcg exchange to look after them.

**19.What is the user type for a background jobs user? **

**Ans:**

- System User.
- Communication User.

**20.Clarify Important Model Performance Statistics? **

**Ans:**

- AUC > 0.7. No critical contrast between the AUC score of training versus validation.
- KS ought to be in the top 3 deciles and it ought to be more than 30 Rank Ordering. No break-in rank requesting.
- Same indications of parameter evaluation in both preparing and approval.

**21.What Is P-value And How It Is Used For Variable Selection? **

**Ans:**

The p-value is a most reduced level of criticalness at which you can dismiss an invalid hypothesis. On account of independent factors, it implies whether the coefficient of a variable is altogether not quite the same as zero.

**22.Clarify The Problem Statement Of Your Project. What Are The Financial Impacts Of It? **

**Ans:**

Cover the target or fundamental objective of your predictive model. Look at fiscal advantages of the predictive model versus the No-model. Additionally features non-fiscal advantages (assuming any).

**23.State the difference between derived role and single role. **

**Ans:**

The t-codes can be added or deleted for a single role whereas, the derived role cannot facilitate that.

**24.Clarify what is authorization object class and authorization object? **

**Ans:**

- Authorization Object Class – Authorization object falls under Authorization object classes, and they are gathered by work a territory like HR, bookkeeping, back, etc.
- Authorization Object – Authorization objects are gathered from the authorization field that oversees specific development. Authorization identifies a specific activity while the Authorization field relates for security administrators to arrange specific characteristics in that particular activity.

**25.Clarify what is PFCG_Time_Dependency? **

**Ans:**

PFCG_TIME_DEPENDENCY is a report that is utilized for client ace examination. It additionally clears up the terminated profiles from the client’s ace record. To straightforwardly execute this report PFUD exchange code can likewise be utilized.

**26.Mention the two tables authorization objects need in order to be maintained? **

**Ans:**

- USOBT
- USOBX

**27.How can you lock all the users simultaneously in SAP? **

**Ans:**

All the users in SAP can be locked simultaneously by running EWZ5 t-code.

**28.How To Handle Missing Values? **

**Ans:**

- We fill/impute lacking values the usage of the following strategies. Or make lacking values as a separate class.
- Mean Imputation for Continuous Variables (No Outlier).
- Median Imputation for Continuous Variables (If Outlier).
- Cluster Imputation for Continuous Variables.
- Imputation with a random value that is drawn between the minimal and maximum of the variable [Random value = min(x) + (max(x) – min(x)) * ranuni(SEED)] Impute Continuous Variables with Zero (Require enterprise information) Conditional Mean Imputation for Continuous Variables.
- Other Imputation Methods for Continuous – Predictive suggest matching, Bayesian linear regression, Linear regression ignoring model mistakes and so on.
- WOE for missing values in express variables.
- Decision Tree, Random Forest, Logistic Regression for Categorical Variables.
- Decision Tree, Random Forest works for each Continuous and Categorical Variable.

**29.How Vif Is Calculated And Interpretation Of It? **

**Ans:**

VIF measures how an awful lot the variance (the rectangular of the estimate’s preferred deviation) of an anticipated regression coefficient is expanded because of collinearity. If the VIF of a predictor variable were nine (√9 = 3) which means that the usual blunders for the coefficient of that predictor variable is 3 instances as huge as it might be if that predictor variable have been uncorrelated with the alternative predictor variables.Steps of calculating VIF VIF run linear regression in which one of the impartial variable is taken into consideration as goal variable and all the different impartial variables considered as independent variables Calculate VIF of the variable. VIF = 1/(1-RSquared).

**30.Do We Remove Intercepts While Calculating Vif? **

**Ans:**

No. VIF depends on the intercept due to the fact there is an intercept within the regression used to determine VIF. If the intercept is eliminated, R-rectangular isn’t meaningful because it can be terrible in which case you will get VIF < 1, implying that the standard error of a variable would go up if that independent variable were uncorrelated with the other predictors.

**31.Explain Collinearity Between Continuous And Categorical Variables. Is Vif A Correct Method To Compute Collinearity In This Case? **

**Ans:**

Collinearity among categorical and non-stop variables may be very commonplace. The choice of reference class for dummy variables influences multicollinearity. It method changing the reference class of dummy variables can keep away from collinearity. Pick a reference category with highest share of cases.

**32.List down the reasons for choosing SAS over other data analytics tools. **

**Ans:**

We will compare SAS with the popular alternatives in the market based on the following aspects:

**33.What is SAS? **

**Ans:**

SAS (Statistical Analytics System).SAS is a software suite for advanced analytics, multivariate analyses, business intelligence, data management and predictive analytics. It is developed by SAS Institute.SAS provides a graphical point-and-click user interface for non-technical users and more advanced options through the SAS language.

**34.What are the features of SAS? **

**Ans:**

- Business Solutions: SAS provides business analysis that can be used as business products for various companies to use.
- Analytics: SAS is the market leader in the analytics of various business products and services.
- Data Access & Management: SAS can also be use as a DBMS software.
- Reporting & Graphics: Hello SAS helps to visualize the analysis in the form of summary, lists and graphic reports.
- Visualization: We can visualize the reports in the form of graphs ranging from simple scatter plots and bar charts to complex multi-page classification panels.

The following are the features of SAS: –

**35.Mention few capabilities of SAS Framework. **

**Ans:**

- Access: As we can learn from the figure, SAS allows us to access data from multiple sources like an Excel file, raw database, Oracle database and SAS Datasets.
- Manage: We can then manage this data to subset data, create variables, validate and clean data.
- Analyze: Further, analysis happens on this data. We can perform simple analyses like frequency and averages and complex analyses including regression and forecasting. SAS is the gold standard for statistical analyses.
- Present: Finally we can present our analysis in the form of list, summary and graphic reports. We can either print these reports, write them to data file or publish them online.

The following are the four capabilities in SAS Framework:

**36.What is the function of output statement in a SAS Program? **

**Ans:**

- You can use the OUTPUT statement to save summary statistics in a SAS data set. This information can then be used to create customized reports or to save historical information about a process.
- Specify the statistics to save in the output data set,
- Specify the name of the output data set, and
- Compute and save percentiles not automatically computed by the CAPABILITY procedure.

**37.What is the function of Stop statement in a SAS Program? **

**Ans:**

Stop statement causes SAS to stop processing the current data step immediately and resume processing statement after the end of current data step.

**38.What is the difference between using drop = data set option in data statement and set statement? **

**Ans:**

- If you don’t want to process certain variables and you do not want them to appear in the new data set, then specify drop = data set option in the set statement.
- Whereas If want to process certain variables and do not want them to appear in the new data set, then specify drop = data set option in the data statement.

**39.Given an unsorted data set, how to read the last observation to a new data set? **

**Ans:**

- data work.calculus;
- set work.comp end=last;
- If last;
- run;

We can read the last observation to a new data set using end= data set option.For example: –

Where calculus is a new data set to be created and comp is the existing data set. last is the temporary variable (initialized to 0) which is set to 1 when the set statement reads the last observation.

**40.What is the difference between reading data from an external file and reading data from an existing data set? **

**Ans:**

The main difference is that while reading an existing data set with the SET statement, SAS retains the values of the variables from one observation to the next. Whereas when reading the data from an external file, only the observations are read. The variables will have to re-declared if they need to be used.

**41.How many data types are there in SAS? **

**Ans:**

There are two data types in SAS. Character and Numeric. Apart from this, dates are also considered as characters although there are implicit functions to work upon dates.

**42.What is the difference between SAS functions and procedures? **

**Ans:**

- data average ;
- set temp ;
- avgtemp = mean( of T1 – T24 ) ;
- run ;

Functions expect argument values to be supplied across an observation in a SAS data set whereas a procedure expects one variable value per observation.**
For example:**

**43.What are the differences between sum function and using “+” operator? **

**Ans:**

SUM function returns the sum of non-missing arguments whereas “+” operator returns a missing value if any of the arguments are missing.

**44.What are the differences between PROC MEANS and PROC SUMMARY? **

**Ans:**

- PROC MEANS produces subgroup statistics only when a BY statement is used and the input data has been previously sorted (using PROC SORT) by the BY variables.
- PROC SUMMARY automatically produces statistics for all subgroups, giving you all the information in one run that you would get by repeatedly sorting a data set by the variables that define each subgroup and running PROC MEANS. PROC SUMMARY does not produce any information in your output. So you will need to use the OUTPUT statement to create a new DATA SET and use PROC PRINT to see the computed statistics.

**45.Give an example where SAS fails to convert character value to numeric value automatically? **

**Ans:**

Suppose value of a variable PayRate begins with a dollar sign ($). When SAS tries to automatically convert the values of PayRate to numeric values, the dollar sign blocks the process. The values cannot be converted to numeric values. Therefore, it is always best to include INPUT and PUT functions in your programs when conversions occur.

**46.How do you delete duplicate observations in SAS? **

**Ans:**

- Proc sort data=SAS-Dataset nodups;
- by var;
- run;
- Proc sql;
- Create SAS-Dataset as select * from Old-SAS-Dataset where var=distinct(var);
- quit;
- Set temp;
- By group;
- If first.group and last.group then
- Run;

There are three ways to delete duplicate observations in a dataset:

By using nodups in the procedure

By using an SQL query inside a procedure

By cleaning the data

**47.How does PROC SQL work? **

**Ans:**

- SAS scans each statement in the SQL procedure and check syntax errors, such as missing semicolons and invalid statements.
- SQL optimizer scans the query inside the statement. The SQL Optimizer decides how the SQL query should be executed in order to minimize run time.
- Any tables in the FROM statement are loaded into the data engine where they can then be accessed in memory.
- Code and Calculations are executed.
- Final Table is created in memory.
- Final Table is sent to the output table described in the SQL statement.

PROC SQL is a simultaneous process for all the observations. The following steps happen when PROC SQL is executed: –

**48.Briefly explain Input and Put function?**

**Ans:**

Input function – Character to numeric conversion- Input(source,informat) put function – Numeric to character conversion- put(source,format)

**49.What would be the result of the following SAS function (given that 31 Dec, 2000 is Sunday)?**

**Ans:**

- Weeks = intck (‘week’,’31 dec 2000’d,’01jan2001’d);
- Years = intck (‘year’,’31 dec 2000’d,’01jan2001’d);
- Months = intck (‘month’,’31 dec 2000’d,’01jan2001’d);

- Years = 1, since both the days are in different calendar years.
- Months = 1 ,since both the days are in different months of the calendar.

Here, we will calculate the weeks between 31st December, 2000 and 1st January, 2001. 31st December 2000 was a Sunday. So 1st January 2001 will be a Monday in the same week. Hence, Weeks = 0

**50. What are the applications of predictive Modelling?**

**Ans:**

- Customer targeting.
- Churn prevention.
- Sales forecasting.
- Market analysis.
- Risk assessment.
- Financial modeling.

**51.What is the length assigned to the target variable by the scan function? **

**Ans:**

200 is the length assigned to the target variable by the scan function

**52.Name few SAS functions? **

**Ans:**

Scan, Substr, trim, Catx, Index, tranwrd, find, Sum.

**53.What is the work of tranwrd function? **

**Ans:**

TRANWRD function replaces or removes all occurrences of a pattern of characters within a character string.

**54.What are the four primary aspects of predictive analytics?**

**Ans:**

- Data Sourcing.
- Data Utility.
- Deep Learning, Machine Learning, and Automation.
- Objectives and Usage.

**55.How do you use the do loop if you don’t know how many times you should execute the do loop? **

**Ans:**

We can use ‘do until’ or ‘do while’ to specify the condition.

**56.How do dates work in SAS data? **

**Ans:**

- Data is central to every data set. In SAS, data is available in tabular form where variables occupy the column space and observations occupy the row space.
- SAS treats numbers as numeric data and everything else falls under character data. Hence SAS has two data types numeric and character.
- Apart from these, dates in SAS are represented in a special way compared to other languages.

**57.What exactly the term Ensembling stands for in predictive modeling?**

**Ans:**

In general, ensembling is a technique of combining two or more algorithms of similar or dissimilar types called base learners. This is done to make a more robust system which incorporates the predictions from all the base learners.

**58.What is the difference between do while and do until? **

**Ans:**

An important difference between the DO UNTIL and DO WHILE statements is that the DO WHILE expression is evaluated at the top of the DO loop. If the expression is false the first time it is evaluated, then the DO loop never executes. Whereas DO UNTIL executes at least once.

**59.How do you specify the number of iterations and specific condition within a single do loop? **

**Ans:**

- data work;
- do i=1 to 20 until(Sum>=20000);
- Year+1;
- Sum+2000;
- Sum+Sum*.10;
- end;
- run;

This iterative DO statement enables you to execute the DO loop until Sum is greater than or equal to 20000 or until the DO loop executes 10 times, whichever occurs first.

**60.What are the parameters of Scan function? **

**Ans:**

This is how the scan function is used. scan(argument,n,delimiters) Here, argument specifies the character variable or expression to scan,n specifies which word to read, and delimiters are special characters that must be enclosed in single quotation marks.

**61.If a variable contains only numbers, can it be a character data type? **

**Ans:**

Yes, it depends on how you use the variable. There are some numbers we will want to use as a categorical value rather than a quantity. An example of this can be a variable called “Foreigner” where the observations have the value “0” or “1” representing not a foreigner and foreigner respectively. Similarly, the ID of a particular table can be in number but does not specifically represent any quantity. Phone numbers is another popular example.

**62.If a variable contains letters or special characters, can it be numeric data type? **

**Ans:**

No, it must be character data type.

**63.What can be the size of largest dataset in SAS? **

**Ans:**

- The number of observations is limited only by computer’s capacity to handle and store them.
- Prior to SAS 9.1, SAS data sets could contain up to 32,767 variables. In SAS 9.1, the maximum number of variables in a SAS data set is limited by the resources available on your computer.

**64.Give some examples where PROC REPORT’s defaults are different than PROC PRINT’s defaults? **

**Ans:**

- No Record Numbers in Proc Report.
- Labels (not var names) used as headers in Proc Report.
- REPORT needs NOWINDOWS option.

**65.Give some examples where PROC REPORT’s defaults are same as PROC PRINT’s defaults? **

**Ans:**

- Variables/Columns in position order.
- Rows ordered as they appear in data set.

**66.What is the purpose of trailing @ and @@? How do you use them? **

**Ans:**

- The single trailing @ tells the SAS system to “hold the line”.
- The double trailing @@ tells the SAS system to “hold the line more strongly”. An Input statement ending with @@ instructs the program to release the current raw data line only when there are no data values left to be read from that line. The @@, therefore, holds the input record even across multiple iteration of the data step.

The trailing @ is also known as a column pointer. By using the trailing @, in the Input statement gives you the ability to read a part of your raw data line, test it and then decide how to read additional data from the same record. –

**67.What is the difference between Order and Group variable in proc report? **

**Ans:**

- If the variable is used as group variable, rows that have the same values are collapsed.
- Group variables produce list report whereas order variable produces summary report.

**68.Give some ways by which you can define the variables to produce the summary report (using proc report)? **

**Ans:**

All of the variables in a summary report must be defined as group, analysis, across or computed variables.

**69.What are the default statistics for means procedure? **

**Ans:**

n-count, mean, standard deviation, minimum, and maximum.

**70.How to limit decimal places for variable using PROC MEANS? **

**Ans:**

By using MAXDEC= option

**71.What is the difference between CLASS statement and BY statement in proc means? **

**Ans:**

- Unlike CLASS processing, BY processing requires that your data already be sorted or indexed in the order of the BY variables.
- BY group results have a layout that is different from the layout of CLASS group results.

**72.What is the difference between PROC MEANS and PROC Summary? **

**Ans:**

The difference between the two procedures is that PROC MEANS produces a report by default. By contrast, to produce a report in PROC SUMMARY, you must include a PRINT option in the PROC SUMMARY statement.

**73.How to specify variables to be processed by the FREQ procedure? **

**Ans:**

By using TABLES Statement.

**74.Describe CROSSLIST option in TABLES statement? **

**Ans:**

Adding the CROSSLIST option to TABLES statement displays crosstabulation tables in ODS column format.

**75.How to create list output for crosstabulations in proc freq? **

**Ans:**

To generate list output for crosstabulations, add a slash (/) and the LIST option to the TABLES statement in your PROC FREQ step.TABLES variable-1*variable-2 <* … variable-n> / LIST;

**76.Where do you use PROC MEANS over PROC FREQ? **

**Ans:**

We will use PROC MEANS for numeric variables whereas we use PROC FREQ for categorical variables.

**77.Explain how merging helps to combine data sets. **

**Ans:**

- Merging combines observations from two or more SAS data sets into a single observation in a new data set.
- A one-to-one merge, shown in the following figure, combines observations based on their position in the data sets. You use the MERGE statement for one-to-one merging.

**78.What do you understand by bagging?**

**Ans:**

Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting.

**79.What is interleaving in SAS? **

**Ans:**

Interleaving combines individual, sorted SAS data sets into one sorted SAS data set. For each observation, the following figure shows the value of the variable by which the data sets are sorted. You interleave data sets using a SET statement along with a BY statement.

**80. I have a dataset concat having variable a b & c. How to rename a b to e & f? **

**Ans:**

- data concat(rename=(a=e b=f));
- set concat;
- run;

We will use the following code to rename a b to e f

**81.You want to run a regression to predict the probability of a flight delay, but there are flights with delays of up to 12 hours that are really messing up your model. How can you address this? **

**Ans:**

This is equivalent to making the model more robust to outliers.See Q3.

**82. (Given a Dataset) Analyze this dataset and give me a model that can predict this response variable. **

**Ans:**

- Start by fitting a simple model (multivariate regression, logistic regression), do some feature engineering accordingly, and then try some complicated models. Always split the dataset into train, validation, test dataset and use cross validation to check their performance.
- Determine if the problem is classification or regression.
- Favor simple models that run quickly and you can easily explain.
- Mention cross validation as a means to evaluate the model.
- Plot and visualize the data.

**83.What could be some issues if the distribution of the test data is significantly different than the distribution of the training data? **

**Ans:**

- The model that has high training accuracy might have low test accuracy. Without further knowledge, it is hard to know which dataset represents the population data and thus the generalizability of the algorithm is hard to measure. This should be mitigated by repeated splitting of train vs test dataset (as in cross validation).
- When there is a change in data distribution, this is called the dataset shift. If the train and test data has a different distribution, then the classifier would likely overfit to the train data.

**84.What are some ways I can make my model more robust to outliers? **

**Ans:**

- Use tree-based methods instead of regression methods as they are more resistant to outliers. For statistical tests, use non parametric tests instead of parametric ones.
- Use robust error metrics such as MAE or Huber Loss instead of MSE.
- Winsorizing the data
- Transforming the data (e.g. log)
- Remove them only if you’re certain they’re anomalies not worth predicting.

We can have regularization such as L1 or L2 to reduce variance (increase bias). –

Changes to the algorithm:

Changes to the data:

**85.What are some differences you would expect in a model that minimizes squared error, versus a model that minimizes absolute error? In which cases would each error metric be appropriate? **

**Ans:**

- MSE is more strict to having outliers. MAE is more robust in that sense, but is harder to fit the model for because it cannot be numerically optimized. So when there are less variability in the model and the model is computationally easy to fit, we should use MAE, and if that’s not the case, we should use MSE.
- MSE: easier to compute the gradient, MAE: linear programming needed to compute the gradient.

**86.What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups? **

**Ans:**

- Accuracy: proportion of instances you predict correctly. Pros: intuitive, easy to explain, Cons: works poorly when the class labels are imbalanced and the signal from the data is weak.
- AUROC: plot fpr on the x axis and tpr on the y axis for different threshold. Given a random positive instance and a random negative instance, the AUC is the probability that you can identify who’s who. Pros: Works well when testing the ability of distinguishing the two classes, Cons: can’t interpret predictions as probabilities (because AUC is determined by rankings), so can’t explain the uncertainty of the model.

**87.What are various ways to predict a binary response variable? Can you compare two of them and tell me when one would be more appropriate? **

**Ans:**

Logistic Regressionfeatures roughly linear, problem roughly linearly separable.robust to noise, use l1,l2 regularization for model selection, avoid overfitting.The output come as probabilities efficient and the computation can be distributed can be used as a baseline for other algorithms (-) can hardly handle categorical features.SVM with a nonlinear kernel, can deal with problems that are not linearly separable (-) slow to train, for most industry scale applications, not really efficient.

**88.What is regularization and where might it be helpful? What is an example of using regularization in a model? **

**Ans:**

Regularization is useful for reducing variance in the model, meaning avoiding overfitting . For example, we can use L1 regularization in Lasso regression to penalize large coefficients.

**89.Why might it be preferable to include fewer predictors over many? **

**Ans:**

- When we add irrelevant features, it increases model’s tendency to overfit because those features introduce more noise. When two variables are correlated, they might be harder to interpret in case of regression, etc.
- Curse of dimensionality.
- Adding random noise makes the model more complicated but useless.
- computational cost.
- Ask someone for more details.

**90.Given training data on tweets and their retweets, how would you predict the number of retweets of a given tweet after 7 days after only observing 2 days worth of data? **

**Ans:**

- Build a time series model with the training data with a seven day cycle and then use that for a new data with only 2 days data.
- Ask someone for more details.
- Build a regression function to estimate the number of retweets as a function of time t.

**91.How could you collect and analyze data to use social media to predict the weather? **

**Ans:**

- We can collect social media data using twitter, Facebook, instagram API’s. Then, for example, for twitter, we can construct features from each tweet, e.g. the tweeted date, number of favorites, retweets, and of course, the features created from the tweeted content itself.
- Then use a multi variate time series model to predict the weather.
- Ask someone for more details.

**92.How would you construct a feed to show relevant content for a site that involves user interactions with items? **

**Ans:**

We can do so using building a recommendation engine. The easiest we can do is to show contents that are popular other users, which is still a valid strategy if for example the contents are news articles. To be more accurate, we can build a content based filtering or collaborative filtering. If there’s enough user usage data, we can try collaborative filtering and recommend contents other similar users have consumed. If there isn’t, we can recommend similar items based on vectorization of items (content based filtering).

**93.How would you design the people you may know feature on LinkedIn or Facebook? **

**Ans:**

- Find strong unconnected people in weighted connection graph.
- Define similarity as how strong the two people are connected.
- Given a certain feature, we can calculate the similarity based on friend connections (neighbors).
- Check-in’s people being at the same location all the time.
- Same college, workplace.
- Have randomly dropped graphs test the performance of the algorithm.

**94.How would you predict who someone may want to send a Snapchat or Gmail to? **

- Ask someone for more details.
- People who someone sent emails the most in the past, conditioning on time decay.

**95.How would you suggest to a franchise where to open a new store? **

**Ans:**

- Build a master dataset with local demographic information available for each location.
- Local income levels, proximity to traffic, weather, population density, proximity to other businesses.
- A reference dataset on local, regional, and national macroeconomic conditions (e.g. unemployment, inflation, prime interest rate, etc.)

**96.In a search engine, given partial data on what the user has typed, how would you predict the user’s eventual search query? **

**Ans:**

Based on the past frequencies of words shown up given a sequence of words, we can construct conditional probabilities of the set of next sequences of words that can show up (n-gram). The sequences with highest conditional probabilities can show up as top candidates.

**97.How would you build a model to predict a March Madness bracket? **

**Ans:**

One vector each for team A and B. Take the difference of the two vectors and use that as an input to predict the probability that team A would win by training the model. Train the models using past tournament data and make a prediction for the new tournament by running the trained model for each round of the tournament.