SAS Certified BI Content Developer for SAS 9 and Business Analytics Questions and Answer (Dumps and Practice Questions)

Question : Refer to the REG procedure output:
An analyst has selected this model as a champion because it shows better model
fit than a competing model with more predictors.
Which statistic justifies this rationale?

1. R-Square
2. Coeff Var
3. Access Mostly Uused Products by 50000+ Subscribers
4. Error DF

Correct Answer : Get Lastest Questions and Answer :
There's an easy way for you to see an overfit model in action. If you analyze a linear regression model that has one predictor for each degree of freedom, you'll always get an R-squared of 100%!
A key benefit of predicted R-squared is that it can prevent you from overfitting a model. As mentioned earlier, an overfit model contains too many predictors and it starts to model the random noise.
R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data.

Adjusted R2 is a modification of R2 that adjusts for the number of explanatory terms in a model. Unlike R2, the adjusted R2 increases only if the new term improves the model more than would be expected by chance. The adjusted R2 can be negative, and will always be less than or equal to R2.

Adjusted R2 does not have the same interpretation as R2. As such, care must be taken in interpreting and reporting this statistic. Adjusted R2 is particularly useful in the Feature selection stage of model building.

Adjusted R2 is not always better than R2: adjusted R2 will be more useful only if the R2 is calculated based on a sample, not the entire population. For example, if our unit of analysis is a state, and we have data for all counties, then adjusted R2 will not yield any more useful information than R2.

Adjusted R2 is used to compensate for the addition of variables to the model. As more independent variables are added to the regression model, unadjusted R2 will generally increase but there will never be a decrease. This will occur even when the additional variables do little to help explain the dependent variable. To compensate for this, adjusted R2 is corrected for the number of independent variables in the model. The result is an adjusted R2 than can go up or down depending on whether the addition of another variable adds or does not add to the explanatory power of the model. Adjusted R2 will always be lower than unadjusted.

It has become standard practice to report the adjusted R2, especially when there are multiple models presented with varying numbers of independent variables.

R2 quantifies how well a model fits the data, so it seems as though it would be an easy way to compare models. It sure sounds easy -- pick the model with the larger R2. The problem with this approach is that there is no penalty for adding more parameters. So the model with more parameters will bend and twist more to come nearer the points, and so almost always has a higher R2. If you use R2 as the criteria for picking the best model, you'd almost always pick the model with the most parameters.

R-squared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model. Adjusted R-squared adjusts the statistic based on the number of independent variables in the model.

The reason this is important is because you can "game" R-squared by adding more and more independent variables, irrespective of how well they are correlated to your dependent variable. Obviously, this isn't a desirable property of a goodness-of-fit statistic. Conversely, adjusted R-squared provides an adjustment to the R-squared statistic such that an independent variable that has a correlation to Y increases adjusted R-squared and any variable without a strong correlation will make adjusted R-squared decrease. That is the desired property of a goodness-of-fit statistic.

About which one to use...in the case of a linear regression with more than one variable: adjusted R-squared. For a single independent variable model, both statistics are interchangeable.

Question : The selection criterion used in the forward selection method in the REG procedure is:

1. Adjusted R-Square
2. SLE
3. Access Mostly Uused Products by 50000+ Subscribers
4. AIC

Correct Answer : Get Lastest Questions and Answer :
Criteria Used in Model-Selection Methods

When many significance tests are performed, each at a level of, for example, 5%, the overall probability of rejecting at least one true null hypothesis is much larger than 5%. If you want to guard against including any variables that do not contribute to the predictive power of the model in the population, you should specify a very small SLE= significance level for the FORWARD and STEPWISE methods and a very small SLS= significance level for the BACKWARD and STEPWISE methods.
In most applications, many of the variables considered have some predictive power, however small. If you want to choose the model that provides the best prediction computed using the sample estimates, you need only to guard against estimating more parameters than can be reliably estimated with the given sample size, so you should use a moderate significance level, perhaps in the range of 10% to 25%.

Question : Which SAS program will correctly use backward elimination selection criterion within the REG procedure?

1. A
2. B
3. Access Mostly Uused Products by 50000+ Subscribers
4. D

Correct Answer : Get Lastest Questions and Answer :
The proc reg procedure is used to perform regression analysis. Proc GLM can also be used to do this analysis by leaving the quantitative variables out of the class statement. In some ways, proc glm is superior to proc reg because proc glm allows manipulations in the model statement (such as x*x to obtain quadratic factors) which are not allowed in proc reg. However, proc reg allows certain automatic model selection features and a crude plotting feature not available in proc glm.
The variables analyzed using proc reg must be numeric variables all of which appear in a SAS data set. If x, y, and z are 3 numeric variables the basic invocation is
proc reg data=stuff;
model z= x y;
run;
There are many options available in the model statement. As in proc glm, the options are listed after a backslash on the model statement line. One example is
proc reg data=stuff;
model z= x y /
noint
selection=stepwise
sle=.05
sls=.05;
run;
The noint option specifies that the fitted model is to have NO intercept (constant) term.
The selection= option specifies how variables are to be introduced into the model. The default (if selection= is not used) is equivalent to selection=none, in which all the variables in the model statement are used. Setting selection=stepwise introduces a variable into the model provided it is significant at the sle level and deletes a variable from the model if it is NOT significant at the sls level. Settingselection=rsquare selects the model which has the maximum value of the square of R.
A final important option in proc reg is the output statement. This statement, which must follow the model statement, creates a SAS data set containing the variables in the original data set together with new variables as specified in the output statement. An illustration of some of the common options is
output
out=results
predicted=pred
residual=resid
L95M=lowmean
U95M=highmean
L95=lowpred
U95=highpred;
The out= option gives the name of the new SAS dataset.
The predicited= option gives the name of the variable in the out= data set which contains the predicted value of the dependent variable. By adding records to the original data set which specify values of the INDEPENDENT variables in the model but set the corresponding value of the DEPENDENT variable to missing, one can obtain predictions given by the model for unobserved settings of the independent variables.
The residual= option gives the name of the variable in the out= data set which contains the value of the residual.
The L95M= and U95M= options give the names of the variables in the out= data set which contain the lower and upper endpoints of a 95% confidence interval for the mean. The L95= and U95= options give the names of the variables in the out= data set which contain the lower and upper endpoints of a 95% confidence interval for the predicted value.

Related Questions

Question : There are missing values in the input variables for a regression application.
Which SAS procedure provides a viable solution?

1. GLM
2. VARCLUS
3. STDIZE
4. CLUSTER

Question : Screening for non-linearity in binary logistic regression can be achieved by visualizing:

1. A scatter plot of binary response versus a predictor variable.
2. A trend plot of empirical logit versus a predictor variable.
3. A logistic regression plot of predicted probability values versus a predictor variable.
4. A box plot of the odds ratio values versus a predictor variable.

Question : Given the SAS data set TEST:
Which SAS program is NOT a correct way to create dummy variables?

1. A
2. B
3. C
4. D

Question : An analyst fits a logistic regression model to predict whether
or not a client will default on a loan. One of the predictors
in the model is agent, and each agent serves 15-20 clients each.
The model fails to converge. The analyst prints the summarized data,
showing the number of defaulted loans per agent. See the partial output below:
What is the most likely reason that the model fails to converge?

1. There is quasi-complete separation in the data.
2. There is collinearity among the predictors.
3. There are missing values in the data
4. There are too many observations in the data.

Question : An analyst knows that the categorical predictor, storeId, is an important
predictor of the target.However, store_Id has too many levels to be a feasible predictor in the model.
The analyst wants to combine stores and treat them as members of the same class level.
What are the two most effective ways to address the problem?

A. Eliminate store_id as a predictor in the model because it has too many levels to be feasible.
B. Cluster by using Greenacre's method to combine stores that are similar.
C. Use subject matter expertise to combine stores that are similar.
D. Randomly combine the stores into five groups to keep the stochastic variation among the observations intact.

1. A,B
2. B,C
3. C,D
4. A,D

Question :

Including redundant input variables in a regression model can:

1. Stabilize parameter estimates and increase the risk of overfitting.
2. Destabilize parameter estimates and increase the risk of overfitting.
3. Stabilize parameter estimates and decrease the risk of overfitting.
4. Destabilize parameter estimates and decrease the risk of overfitting.