Question : A predictive model uses a data set that has several variables with missing values. What two problems can arise with this model?
A. The model will likely be overfit. B. There will be a high rate of collinearity among input variables. C. Complete case analysis means that fewer observations will be used in the model building process. D. New cases with missing values on input variables cannot be scored without extra data processing.
Many of the input variables in the Donor data set that you have been using have missing values. If an observation contains a missing value, then by default that observation is not used for modeling by nodes such as Variable Selection, Neural Network, or Regression. Depending on the type of predictive model that you build, missing values can cause problems. If your model is based on a decision tree, missing values cause no problems because decision trees handle missing values directly. However, in Enterprise Miner, regression and neural network models ignore observations that contain missing values. Substantially reducing the size of the training data set can weaken these predictive models. It is wise to impute missing values before you fit a regression model or neural network model. When you replace missing observations with imputed data, regression and neural network algorithms are able to perform whole-case analysis on the entire training data set. If you do not impute missing values for these models, the missing values might result in the creation of an inferior model. Additionally, it is important to impute missing values if you are planning to compare a regression model or neural network model with a decision tree model, because it is more appropriate to compare models that are built on the same set of observations.
Question : Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between which function of the input variables? 1. Concordant and discordant pairs of ranked observations 2. Logit link (log(p/1-p)) 3. Access Mostly Uused Products by 50000+ Subscribers 4. Weighted sum of chi-square statistics for 2x2 tables
Explanation: SPEARMAN Requests Spearman rank-order correlation requests a table of Spearman correlation coefficients based on the ranks of the variables. The correlations range from 1 to 1. If you specify a WEIGHT statement, the SPEARMAN option is invalid. PROC CORR computes the Spearman correlation by ranking the data and using the ranks in the Pearson product-moment correlation formula. In case of ties, the averaged ranks are used.
Question : A non-contributing predictor variable (Pr > |t| =.) is added to an existing multiple linear regression model. What will be the result? 1. An increase in R-Square 2. A decrease in R-Square 3. Access Mostly Uused Products by 50000+ Subscribers 4. No change in R-Square
R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data.
In some (but not all) instances where R2 is used, the predictors are calculated by ordinary least-squares regression: that is, by minimizing SSE. In this case R-squared increases as we increase the number of variables in the model (R-squared will not decrease). This illustrates a drawback to one possible use of R2, where one might try to include more variables in the model until "there is no more improvement". This leads to the alternative approach of looking at the adjusted R2. The explanation of this statistic is almost the same as R-squared but it penalizes the statistic as extra variables are included in the model. For cases other than fitting by ordinary least squares, the R2statistic can be calculated as above and may still be a useful measure. However, the conclusion that that R-squared increases with extra variables no longer holds, but downward variations are usually small. If fitting is by weighted least squares or generalized least squares, alternative versions of R2can be calculated appropriate to those statistical frameworks, while the "raw" R2 may still be useful if it is more easily interpreted. Values for R2 can be calculated for any type of predictive model, which need not have a statistical basis.
1. More high value customers are found in some regions than others. 2. The difference between average purchases for medium and high value customers depends on the region 3. Regions with higher average purchases have more high value customers. 4. Regions with higher average purchases have more medium value customers.