Question : When mean imputation is performed on data after the data is partitioned for honest assessment, what is the most appropriate method for handling the mean imputation? 1. The sample means from the validation data set are applied to the training and test data sets. 2. The sample means from the training data set are applied to the validation and test data sets. 3. The sample means from the test data set are applied to the training and validation data sets. 4. The sample means from each partition of the data are applied to their own partition.
Correct Answer : 2
Explanation: Perhaps that's a bit dramatic, but mean imputation (also called mean substitution) really ought to be a last resort in most situations. There are many alternatives to mean imputation that provide much more accurate parameter estimates, so there really is no excuse to use it. First, a definition: mean imputation is the replacement of a missing observation with the mean of the non-missing observations for that variable. Problem #1: Mean imputation does not preserve the relationships among variables. True, imputing the mean preserves the mean of the observed data. So if the data are missing at random, the estimate of the mean remains unbiased. This is the original logic involved in mean imputation. If all you are doing is estimating means (which is rarely the point of research studies), mean imputation will not bias your parameter estimate. (It will bias your standard error, but I will get to that in another post). Since most research studies are interested in the relationship among variables, mean imputation is not a good solution. We're both users of multiple imputation for missing data. We believe it is the most practical principled method for incorporating the most information into data analysis. In fact, one of our more successfulcollaborations is a review of software for multiple imputation.
But, for me at least, there are times when a simpler form of imputation may be useful. For example, it may be desirable to calculate the mean of the observed values and substitute it for any missing values. Typically it would be unwise to attempt to use a data set completed in this way for formal inference, but it could be convenient under deadline pressure or for a very informal overview of the data.
Nick disagrees. He finds it hard to imagine any setting in which he would ever use such a primitive approach. He passes on to the reader the sage advice he received in graduate school: that making up data in such an ad-hoc fashion might be construed as dereliction or even misconduct. Use of single imputation approaches (which yield bias in many settings and attenuate estimates of variance) seems hard to justify in 2014. But one of the hallmarks of our partnership is that we can agree to disagree on an absolute ban, while jointly advising the reader to proceed with great caution.
SAS In SAS, it would possible to approach this using proc means to find the means and then add them back into the data set in a data step. But there is a simpler way, using proc standard.
proc standard data=indata out=outdata replace; run; This will replace the values of all missing numeric variables in the indata data set with the mean of the observed values, and save the result in a new data set, outdata. To restrict the operation to specific variables, use a var statement.
Question : An analyst generates a model using the LOGISTIC procedure. They are now interested in getting the sensitivity and specificity statistics on a validation data set for a variety of cutoff values. Which statement and option combination will generate these statistics? 1. Scoredata=valid1 out=roc; 2. Scoredata=valid1 outroc=roc; 3. mode1resp(event= '1') = gender region/outroc=roc; 4. mode1resp(event"1") = gender region/ out=roc; Correct answer: 2 The OUTROC= data set contains data necessary for producing the ROC curve. It names the SAS data set that contains the ROC curve for the DATA= data set. The ROC curve is computed only for binary response data. The SCORE statement creates a data set that contains all the data in the DATA= data set together with posterior probabilities and, optionally, prediction confidence intervals. Fit statistics are displayed on request. If you have binary response data, the SCORE statement can be used to create a data set containing data for the ROC curve. You can specify several SCORE statements.
Question : In partitioning data for model assessment, which sampling methods are acceptable?
A. Simple random sampling without replacement B. Simple random sampling with replacement C. Stratified random sampling without replacement D. Sequential random sampling with replacement
1. A,B 2. B,C 3. A,D 4. A,C 5. A,B,C
Correct Answer : 4
Explanation: Simple Random Sampling: A simple random sample (SRS) of size n is produced by a scheme which ensures that each subgroup of the population of size n has an equal probability of being chosen as the sample.
Stratified Random Sampling: Divide the population into "strata". There can be any number of these. Then choose a simple random sample from each stratum. Combine those into the overall sample. That is a stratified random sample. (Example: Church A has 600 women and 400 women as members. One way to get a stratified random sample of size 30 is to take a SRS of 18 women from the 600 women and another SRS of 12 men from the 400 men.)
Multi-Stage Sampling: Sometimes the population is too large and scattered for it to be practical to make a list of the entire population from which to draw a SRS. For instance, when the a polling organization samples US voters, they do not do a SRS. Since voter lists are compiled by counties, they might first do a sample of the counties and then sample within the selected counties. This illustrates two stages. In some instances, they might use even more stages. At each stage, they might do a stratified random sample on sex, race, income level, or any other useful variable on which they could get information before sampling.
How does one decide which type of sampling to use?
The formulas in almost all statistics books assume simple random sampling. Unless you are willing to learn the more complex techniques to analyze the data after it is collected, it is appropriate to use simple random sampling. To learn the appropriate formulas for the more complex sampling schemes, look for a book or course on sampling.
Stratified random sampling gives more precise information than simple random sampling for a given sample size. So, if information on all members of the population is available that divides them into strata that seem relevant, stratified sampling will usually be used.
If the population is large and enough resources are available, usually one will use multi-stage sampling. In such situations, usually stratified sampling will be done at some stages.
How do we analyze the results differently depending on the different type of sampling?
The main difference is in the computation of the estimates of the variance (or standard deviation). An excellent book for self-study is A Sampler on Sampling, by Williams, Wiley. In this, you see a rather small population and then a complete derivation and description of the sampling distribution of the sample mean for a particular small sample size. I believe that is accessible for any student who has had an upper-division mathematical statistics course and for some strong students who have had a freshman introductory statistics course. A very simple statement of the conclusion is that the variance of the estimator is smaller if it came from a stratified random sample than from simple random sample of the same size. Since small variance means more precise information from the sample, we see that this is consistent with stratified random sampling giving better estimators for a given sample size.
Question : RMSE measures error of a predicted 1. Numerical Value 2. Categorical values 3. For booth Numerical and categorical values 4. None of the above