Premium

SAS Certified BI Content Developer for SAS 9 and Business Analytics Questions and Answer (Dumps and Practice Questions)



Question : An analyst has a sufficient volume of data to perform a -way partition of the data into training,
validation, and test sets to perform honest assessment during the model building process. What is the purpose of the test data set?
 : An analyst has a sufficient volume of data to perform a -way partition of the data into training,
1. To provide a unbiased measure of assessment for the final model.
2. To compare models and select and fine-tune the final model.
3. To reduce total sample size to make computations more efficient.
4. To build the predictive models.

Correct Answer : 1 Using Validation and Test Data : When you have sufficient data, you can subdivide your data into three parts called the training, validation, and test data. During the selection process, models are fit on the training data, and the prediction error for the models so obtained is found by using the validation data. This prediction error on the validation data can be used to decide when to terminate the selection process or to decide what effects to include as the selection process proceeds. Finally, once a selected model has been obtained, the test set can be used to assess how the selected model generalizes on data that played no role in selecting the model. In some cases you might want to use only training and test data. For example, you might decide to use an information criterion to decide what effects to include and when to terminate the selection process. In this case no validation data are required, but test data can still be useful in assessing the predictive performance of the selected model. In other cases you might decide to use validation data during the selection process but forgo assessing the selected model on test data. Hastie, Tibshirani, and Friedman (2001) note that it is difficult to give a general rule on how many observations you should assign to each role. They note that a typical split might be 50% for training and 25% each for validation and testing. PROC GLMSELECT provides several methods for partitioning data into training, validation, and test data. You can provide data for each role in separate data sets that you specify with the DATA=, TESTDATA=, and VALDATA= options in the PROC GLMSELECT procedure. An alternative method is to use a PARTITION statement to logically subdivide the DATA=data set into separate roles. You can name the fractions of the data that you want to reserve as test data and validation data. For example, specifying
proc glmselect data=inData; partition fraction(test=0.25 validate=0.25);
...
run; randomly subdivides the "inData" data set, reserving 50% for training and 25% each for validation and testing.
In some cases you might need to exercise more control over the partitioning of the input data set. You can do this by naming a variable in the input data set as well as a formatted value of that variable that correspond to each role. For example, specifying
proc glmselect data=inData; partition roleVar=group(test='group 1' train='group 2')
...
run; assigns all roles observations in the "inData" data set based on the value of the variable named group in that data set. Observations where the value of group is 'group 1' are assigned for testing, and those with value 'group 2' are assigned to training. All other observations are ignored.
You can also combine the use of the PARTITION statement with named data sets for specifying data roles. For example,
proc glmselect data=inData testData=inTest; partition fraction(validate=0.4);
...
run; reserves 40% of the "inData" data set for validation and uses the remaining 60% for training. Data for testing is supplied in the "inTest" data set. Note that in this case, because you have supplied a TESTDATA= data set, you cannot reserve additional observations for testing with the PARTITION statement.
When you use a PARTITION statement, the output data set created with an OUTPUT statement contains a character variable _ROLE_ whose values "TRAIN," "TEST," and "VALIDATE" indicate the role of each observation. _ROLE_ is blank for observations that were not assigned to any of these three roles. When the input data set specified in theDATA= option in the PROC GLMSELECT statement contains an _ROLE_ variable and no PARTITION statement is used, and TESTDATA= and VALDATA= are not specified, then the _ROLE_ variable is used to define the roles of each observation. This is useful when you want to rerun PROC GLMSELECT but use the same data partitioning as in a previous PROC GLMSELECT step. For example, the following statements use the same data for testing and training in both PROC GLMSELECT steps:
proc glmselect data=inData;
partition fraction(test=0.5); model y=x1-x10/selection=forward; output out=outDataForward;
run;
proc glmselect data=outDataForward; model y=x1-x10/selection=backward;
run; When you have reserved observations for training, validation, and testing, a model fit on the training data is scored on the validation and test data, and the average squared error, denoted by ASE, is computed separately for each of these subsets. The ASE for each data role is the error sum of squares for observations in that role divided by the number of observations in that role.





Question : Refer to the confusion matrix:
Calculate the sensitivity. (0 - negative outcome, 1 - positive outcome)
Click the calculator button to display a calculator if needed.

 : Refer to the confusion matrix:
1. 25/48
2. 58/102
3. 25/89
4. 58/81



Correct Answer : 1
A confusion matrix involves a comparison of predicted values to actual values.
The classification accuracy rate (Acc), sensitivity (Sen), specificity (Spec) and
precision rate (Pre) were chosen as the criteria in measuring the performance of the Decision Tree model.


Explanation: Refer Study notes




Question :

The total modeling data has been split into training, validation, and test data. What is the best data to use for model assessment?
 :
1. Training data
2. Total data
3. Test data
4. Validation data

Correct Answer : 4 Using Validation and Test Data : When you have sufficient data, you can subdivide your data into three parts called the training, validation, and test data. During the selection process, models are fit on the training data, and the prediction error for the models so obtained is found by using the validation data. This prediction error on the validation data can be used to decide when to terminate the selection process or to decide what effects to include as the selection process proceeds. Finally, once a selected model has been obtained, the test set can be used to assess how the selected model generalizes on data that played no role in selecting the model. In some cases you might want to use only training and test data. For example, you might decide to use an information criterion to decide what effects to include and when to terminate the selection process. In this case no validation data are required, but test data can still be useful in assessing the predictive performance of the selected model. In other cases you might decide to use validation data during the selection process but forgo assessing the selected model on test data. Hastie, Tibshirani, and Friedman (2001) note that it is difficult to give a general rule on how many observations you should assign to each role. They note that a typical split might be 50% for training and 25% each for validation and testing. PROC GLMSELECT provides several methods for partitioning data into training, validation, and test data. You can provide data for each role in separate data sets that you specify with the DATA=, TESTDATA=, and VALDATA= options in the PROC GLMSELECT procedure. An alternative method is to use a PARTITION statement to logically subdivide the DATA=data set into separate roles. You can name the fractions of the data that you want to reserve as test data and validation data. For example, specifying
proc glmselect data=inData; partition fraction(test=0.25 validate=0.25);
...
run; randomly subdivides the "inData" data set, reserving 50% for training and 25% each for validation and testing.
In some cases you might need to exercise more control over the partitioning of the input data set. You can do this by naming a variable in the input data set as well as a formatted value of that variable that correspond to each role. For example, specifying
proc glmselect data=inData; partition roleVar=group(test='group 1' train='group 2')
...
run; assigns all roles observations in the "inData" data set based on the value of the variable named group in that data set. Observations where the value of group is 'group 1' are assigned for testing, and those with value 'group 2' are assigned to training. All other observations are ignored.
You can also combine the use of the PARTITION statement with named data sets for specifying data roles. For example,
proc glmselect data=inData testData=inTest; partition fraction(validate=0.4);
...
run; reserves 40% of the "inData" data set for validation and uses the remaining 60% for training. Data for testing is supplied in the "inTest" data set. Note that in this case, because you have supplied a TESTDATA= data set, you cannot reserve additional observations for testing with the PARTITION statement.
When you use a PARTITION statement, the output data set created with an OUTPUT statement contains a character variable _ROLE_ whose values "TRAIN," "TEST," and "VALIDATE" indicate the role of each observation. _ROLE_ is blank for observations that were not assigned to any of these three roles. When the input data set specified in theDATA= option in the PROC GLMSELECT statement contains an _ROLE_ variable and no PARTITION statement is used, and TESTDATA= and VALDATA= are not specified, then the _ROLE_ variable is used to define the roles of each observation. This is useful when you want to rerun PROC GLMSELECT but use the same data partitioning as in a previous PROC GLMSELECT step. For example, the following statements use the same data for testing and training in both PROC GLMSELECT steps:
proc glmselect data=inData;
partition fraction(test=0.5); model y=x1-x10/selection=forward; output out=outDataForward;
run;
proc glmselect data=outDataForward; model y=x1-x10/selection=backward;
run; When you have reserved observations for training, validation, and testing, a model fit on the training data is scored on the validation and test data, and the average squared error, denoted by ASE, is computed separately for each of these subsets. The ASE for each data role is the error sum of squares for observations in that role divided by the number of observations in that role.



Related Questions


Question : What is the default method in the LOGISTIC procedure to handle observations with missing data?
  : What is the default method in the LOGISTIC procedure to handle observations with missing data?
1. Missing values are imputed.
2. Parameters are estimated accounting for the missing values.
3. Parameter estimates are made on all available data.
4. Only cases with variables that are fully populated are used.


Question : Given the output from the LOGISTIC procedure:
Which variables, among those that are statistically
significant at an alpha of 0.05, have the greatest
and least relative importance on the fitted model?

  : Given the output from the LOGISTIC procedure:
1. A. Greatest: MBA
Least: DOWN_AMT
2. Greatest: MBA
Least: CASH
3. Greatest: DOWN_AMT
Least: CASH
4. Greatest: DOWN_AMT
Least: HOME



Question : A marketing manager attempts to
determine those customers most likely to purchase
additional products as the result of a nation-wide marketing campaign.
The manager possesses a historical dataset (CAMPAIGN)
of a similar campaign from last year.It has the following characteristics:

Target variable Respond (0,1)
Continuous predictor Income
Categorical predictor Homeowner(Y,N)

Which SAS program performs this analysis?

  : A marketing manager attempts to
1. A
2. B
3. C
4. D




Question : The question will ask you to provide a missing statement. With the given program:
Which SAS statement will complete the program to correctly score the data set NEW_DATA?
  : The question will ask you to provide a missing statement. With the given program:
1. Scoredata data=MYDIR.NEW_DATA out=scores;
2. Scoredata data=MYDIR.NEW_DATA output=scores;
3. Score data=HYDIR.NEU_DATA output=scores;
4. Score data=MYDIR,NEW DATA out=scores;


Question : Which statistic, calculated from a validation sample, can help decide which model to use for prediction of a binary target variable?
 :  Which statistic, calculated from a validation sample, can help decide which model to use for prediction of a binary target variable?
1. Adjusted R Square
2. Mallows Cp
3. Chi Square
4. Average Squared Error




Question : Logistic regression is a model used for prediction of the probability of occurrence of an event.
It makes use of several variables that may be___
  : Logistic regression is a model used for prediction of the probability of occurrence of an event.
1. Numerical
2. Categorical
3. Both 1 and 2 are correct
4. None of the 1 and 2 are correct