Premium

SAS Certified BI Content Developer for SAS 9 and Business Analytics Questions and Answer (Dumps and Practice Questions)



Question : An analyst fits a logistic regression model to predict whether
or not a client will default on a loan. One of the predictors
in the model is agent, and each agent serves 15-20 clients each.
The model fails to converge. The analyst prints the summarized data,
showing the number of defaulted loans per agent. See the partial output below:
What is the most likely reason that the model fails to converge?

 : An analyst fits a logistic regression model to predict whether
1. There is quasi-complete separation in the data.
2. There is collinearity among the predictors.
3. There are missing values in the data
4. There are too many observations in the data.

Correct Answer : Get Lastest Questions and Answer :
Explanation: When we perform the logistic regression, sometimes, we may run into an issue so called 'complete or quasi-complete separation of data points'. In this situation, the maximum likelihood estimate does not exist. If we use SAS Proc Logistic, SAS log will give a warning message "WARNING: There is possibly a quasi-complete separation of data points. The maximum likelihood estimate may not exist. WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable." SAS will continue to report the Wald test results and odds ratios, however, these tests are no longer valid and results are not reliable (actually not accurate at all).
Complete separation data is something like below:
Y X
0 1
0 2
0 4
1 5
1 6
1 9
There is complete separation because all of the cases in which Y is 0 have X values equal to or less than 4, and the cases in which Y is 1 have X values equal to or greater than 5. In other words, Maximal value in one group is less than the minimal value in another group. When maximal value in one group is equal to the minimal value in another group, quasi-complete separation data may occur. If the explanatory variable is categorical, complete separation of data points could be something like this: Response Failure Success
Predictor
0 25 0
1 0 21
Where There are no successes when the value of the predictor variable is 0, and there are no failures when the value of the predictor variable is 1.
For maximum likelihood estimates to exist, there must be some overlaps in the two distributions. Since logistic regression models uses maximum likelihood estimates, when there is no overlaps of data points between two groups, the results from logistic regression models are unreliable and should not be credited.
Starting from SAS version 9.2, Proc Logistic provides Firth estimation for dealing with the issue of quasi or complete separation of data points.
proc logistic; model y = x /firth; run;
However, even after Firth estimation, the results should still be interpreted with extreme caution. Complete separation and quasi-complete separation of the data points may occur when the sample size is small and number of data points is not large or in the situation the samples are determined by the outcome (i.e., response) rather than explanatory variables - we see many publications where the analysis is based on the responders vs. non-responders.
When complete separation or quasi-complete separation occurs, for multivariate regression, the explanatory variable causing this situation should be identified and preferably excluded from the model. For univariate regression, other alternative statistical tests (for example group t-test) should be used.






Question : An analyst knows that the categorical predictor, storeId, is an important
predictor of the target.However, store_Id has too many levels to be a feasible predictor in the model.
The analyst wants to combine stores and treat them as members of the same class level.
What are the two most effective ways to address the problem?

A. Eliminate store_id as a predictor in the model because it has too many levels to be feasible.
B. Cluster by using Greenacre's method to combine stores that are similar.
C. Use subject matter expertise to combine stores that are similar.
D. Randomly combine the stores into five groups to keep the stochastic variation among the observations intact.

 : An analyst knows that the categorical predictor, storeId, is an important
1. A,B
2. B,C
3. C,D
4. A,D

Correct Answer : Get Lastest Questions and Answer : GREENACRE | GRE
displays adjusted inertias when performing multiple correspondence analysis. By
default, unadjusted inertias, the usual inertias from multiple correspondence analysis,
are displayed. However, adjusted inertias using a method proposed by Greenacre
(1994, p. 156) can be displayed by specifying the GREENACRE option. Specify the
UNADJUSTED option to output the usual table of unadjusted inertias as well.



Question :

Including redundant input variables in a regression model can:
 :
1. Stabilize parameter estimates and increase the risk of overfitting.
2. Destabilize parameter estimates and increase the risk of overfitting.
3. Stabilize parameter estimates and decrease the risk of overfitting.
4. Destabilize parameter estimates and decrease the risk of overfitting.

Correct Answer : Get Lastest Questions and Answer :

Explanation: Multicollinearity: one or a combination of explanatory variables is redundant. Multicollinearity leads to an over-counting type of bias and an unstable/unreliable model. In regression, "multicollinearity" refers to predictors that are correlated with other predictors. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant. You can think about it in terms of a football game: If one player tackles the opposing quarterback, it's easy to give credit for the sack where credit's due. But if three players are tackling the quarterback simultaneously, it's much more difficult to determine which of the three makes the biggest contribution to the sack. Not that into football? All right, try this analogy instead: You go to see a rock and roll band with two great guitar players. You're eager to see which one plays best. But on stage, they're both playing furious leads at the same time! When they're both playing loud and fast, how can you tell which guitarist has the biggest effect on the sound? Even though they aren't playing the same notes, what they're doing is so similar it's difficult to tell one from the other. That's the problem with multicollinearity.
Multicollinearity increases the standard errors of the coefficients. Increased standard errors in turn means that coefficients for some independent variables may be found not to be significantly different from 0. In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. Without multicollinearity (and thus, with lower standard errors), those coefficients might be significant.
A little bit of multicollinearity isn't necessarily a huge problem: extending the rock band analogy, if one guitar player is louder than the other, you can easily tell them apart. But severe multicollinearity is a major problem, because it increases the variance of the regression coefficients, making them unstable. The more variance they have, the more difficult it is to interpret the coefficients.
So, how do you know if you need to be concerned about multicollinearity in your regression model? Here are some things to watch for:
" A regression coefficient is not significant even though, theoretically, that variable should be highly correlated with Y.
" When you add or delete an X variable, the regression coefficients change dramatically.
" You see a negative regression coefficient when your response should increase along with X.
" You see a positive regression coefficient when the response should decrease as X increases.
" Your X variables have high pairwise correlations.
How Can I Deal With Multicollinearity?
If multicollinearity is a problem in your model -- if the VIF for a factor is near or above 5 -- the solution may be relatively simple. Try one of these:
" Remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't drastically reduce the R-squared. Consider using stepwise regression, best subsets regression, or specialized knowledge of the data set to remove these variables. Select the model that has the highest R-squared value.

" Use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components. Multicollinearity is problem that you can run into when you're fitting a regression model, or other linear model. It refers to predictors that are correlated with other predictors in the model. Unfortunately, the effects of multicollinearity can feel murky and intangible, which makes it unclear whether it's important to fix.
Moderate multicollinearity may not be problematic. However, severe multicollinearity is a problem because it can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model. The result is that the coefficient estimates are unstable and difficult to interpret. Multicollinearity saps the statistical power of the analysis, can cause the coefficients to switch signs, and makes it more difficult to specify the correct model.






Related Questions


Question : A marketing campaign will send brochures describing an expensive product to a set of customers.
The cost for mailing and production per customer is $50. The company makes $500 revenue for each sale.
What is the profit matrix for a typical person in the population?
 :  A marketing campaign will send brochures describing an expensive product to a set of customers.
1. A
2. B
3. C
4. D



Question : Select the correct statements from the below.
1. The sum of errors will be larger than mean absolute error if errors are positive
2. The mean absolute error will, be larger than the sum if errors are negative
3. The mean absolute error will, be smaller than the sum if errors are negative
4. RMSE will equal MAE if all errors are equally large
5. RMSE will be smaller if all errors are not equally large
6. RMSE will be larger if all errors are not equally large
 :  Select the correct statements from the below.
1. 1,3,4,6
2. 1,2,4,6
3. 2,3,4,6
4. 2,3,5,6




Question : You are working in an ecommerce organization, where you are designing and evaluating a recommender system,
you need to select which of the following metric will always have the largest value?
 :  You are working in an ecommerce organization, where you are designing and evaluating a recommender system,
1. Root Mean Square Error
2. Sum of Errors
3. Mean Absolute Error
4. Information is not good enough.


Question : Both the MAE and RMSE can range from to infinite, higher values are better.
 :  Both the MAE and RMSE can range from  to infinite, higher values are better.
1. True
2. False




Question : A confusion matrix is created for data that were oversampled due to a rare target.
What values are not affected by this oversampling?



 :  A confusion matrix is created for data that were oversampled due to a rare target.
1. Sensitivity and PV+
2. Specificity and PV
3. PV+ and PVD.
4. Sensitivity and Specificity



Question : RMSE is most useful when large errors are particularly undesirable.
 :  RMSE is most useful when large errors are particularly undesirable.
1. True
2. False