Question : Select the choice where Regression algorithms are not best fit 1. When the dimension of the object given 2. Weight of the person is given 3. Temperature in the atmosphere 4. Employee status
Correct Answer : 4
Explanation: Regression algorithms are usually employed when the data points are inherently numerical variables (such as the dimensions of an object, the weight of a person, or the temperature in the atmosphere) but, unlike Bayesian algorithms, they're not very good for categorical data (such as employee status or credit score description).
Question :Logistic regression does not work well in case of binary classification 1. True 2. False
Correct Answer : 2
Explanation: In logistic regression, the model (the logistic function) takes values between 0 and 1, which can be interpreted as the probability of class membership and works well in the case of binary classification.
Question : Refer to the ROC curve: As you move along the curve, what changes? 1. The priors in the population 2. The true negative rate in the population 3. The proportion of events in the training data 4. The probability cutoff for scoring
Correct Answer : 4 Explanation: Use of predicted probabilities for ROC gives information about how well the linear combination of indicator variables distinguish between a case and a non-case. However, as far as I understand, this method will not help much in determining the cut-off scores in terms of Raw scores. The method that I followed is detailed below. First, I selected those indicators that worked significantly better than others by examining the significance of difference in terms of AUC. For this purpose I used Sigma Plot. After, selecting the significantly better indicators, I conducted an exploratory factor analysis to see whether the indicators can be pooled in fewer yet meaningful factors. The obtained factor structure was then tested using confirmatory factor analytic approach (using AMOS) and if a good fit was noted then two approaches were followed to obtain a single score representing the linear combination of the variables forming a given factor. First approach used the latent factor score (using the option for getting imputed score in the AMOS) and the second approach followed the aggregate score (i.e. the sum of the variables constituting the given factor). The ROC was conducted using both the scores and the AUCs were compared for significance of difference. Further, the AUCs of the composite scores were also compared with the AUCs of individual indicators to establish that the linear combination does better than any of the single indicators. If the combination score performed better than any individual indicator then using the Yoden index as well as the intersection of the sensitivity and specificity the cutoff score for the composite score was computed. For this purpose we used Sigma Plot for computation and the SPSS for plotting the values (we used SPSS because we are well versed with it otherwise it can be done using any software including the Sigma Plot and even the Excel). The aforesaid approach, however, will provide information that combination of various indicators perform better (or do not perform better) than any of the single indicator in making a diagnosis. It will not provide information about the combination of the individual cut-off scores of various indicators in making a better diagnosis as compared to any of the indicator alone. For this purpose, MedCal ( a software) can provide help. However, we did not use it in our own research as the sample size was not much large for this purpose. If you are interested in this then I think the use of MedCalc may help you in getting your answer. The general approach (using the MedCalc) is to conduct the ROC for a single (best) indicator and determine the cut-off score. Then to filter the cases using this cutoff score (i.e., select cases with a score higher than or equal to the cutoff score and then add the next best indicator and perform the ROC and determine the cutoff score for this second indicator. The sensitivity and specificity associated with this second indicator (obtained from the cases having a score higher than or equal to the cutoff score on the first indicator) is in fact the sensitivity and specificity for the combination of the cutoff scores of the two indicators. For instance, the analysis using the indicator X yielded an equal sensitivity and specificity for a cutoff score of say 10 or higher then one will select the cases with score 10 or higher on the variable X. Now on this sub-sample of cases one will perform ROC using another indicator (say Y). If the analysis revealed that a cutoff score of 12 on Y results in a sensitivity and specificity higher than the X alone (say equal sensitivity and specificity of 93%) then one may conclude that a score of 10 or higher on X combined with a score of 12 or higher on Y gives a better diagnostic accuracy ( with sensitivity and specificity of 93%) than X or Y alone. To support this conclusion the comparison of the AUCs of X, Y and the combination of X and Y will be required. To understand that the ROC curve is a plot where the points on the plot are calculated from the counts in the confusion matrix for a given model score cut-off. If you take the output of the ctable pprob=0.1 to 1 by 0.1 then you have the counts of TN TP FN FP that allow you to calculate the x and y coordinates on the roc curve for 10 different pr cut-offs. What you need to understand is what is the cost matrix associated with TN TP FN FP so that you can make decisions about where is the optimal cut-off for your particular problem. From your question, it looks like you need to do some more study to understand what a roc curve represents, and how to use a risk score generated by a logistic regression. Actually a risk score generated by a model (which does not actually have to be a statistical model) Sensitivity and specificity : The whole point of an ROC curve is to help you decide where to draw the line between 'normal' and 'not normal'. This will be an easy decision if all the control values are higher (or lower) than all the patient values. Usually, however, the two distributions overlap, making it not so easy. If you make the threshold high, you won't mistakenly diagnose the disease in many who don't have it, but you will miss some of the people who have the disease. If you make the threshold low, you'll correctly identify all (or almost all) of the people with the disease, but will also diagnose the disease in more people who don't have it. To help you make this decision, Prism tabulates and plots the sensitivity and specificity of the test at various cut-off values. Sensitivity: The fraction of people with the disease that the test correctly identifies as positive. Specificity: The fraction of people without the disease that the test correctly identifies as negative. Prism calculates the sensitivity and specificity using each value in the data table as the cutoff value. This means that it calculates many pairs of sensitivity and specificity. If you select a high threshold, you increase the specificity of the test, but lose sensitivity. If you make the threshold low, you increase the test's sensitivity but lose specificity. Prism displays these results in two forms. The table labeled "ROC" curve is used to create the graph of 100%-Specificity% vs. Sensitivity%. The table labeled "Sensitivity and Specifity" tabulates those values along with their 95% confidence interval for each possible cutoff between normal and abnormal. Area : The area under a ROC curve quantifies the overall ability of the test to discriminate between those individuals with the disease and those without the disease. A truly useless test (one no better at identifying true positives than flipping a coin) has an area of 0.5. A perfect test (one that has zero false positives and zero false negatives) has an area of 1.00. Your test will have an area between those two values. Even if you choose to plot the results as percentages, Prism reports the area as a fraction. Prism computes the area under the entire AUC curve, starting at 0,0 and ending at 100, 100. Note that whether or not you ask Prism to plot the ROC curve out to these extremes, it computes the area for that entire curve. While it is clear that the area under the curve is related to the overall ability of a test to correctly identify normal versus abnormal, it is not so obvious how one interprets the area itself. There is, however, a very intuitive interpretation. If patients have higher test values than controls, then: The area represents the probability that a randomly selected patient will have a higher test result than a randomly selected control. If patients tend to have lower test results than controls: The area represents the probability that a randomly selected patient will have a lower test result than a randomly selected control.
Question : Refer to the lift chart: At a depth of 0.1, Lift = 3.14. What does this mean? 1. Selecting the top 10% of the population scored by the model should result in 3.14 times more events than a random draw of 10%. 2. Selecting the observations with a response probability of at least 10% should result in 3.14 times more events than a random draw of 10%. 3. Selecting the top 10% of the population scored by the model should result in 3.14 timesgreater accuracy than a random draw of 10%. 4. Selecting the observations with a response probability of atleast 10% should result in 3.14times greater accuracy than a random draw of 10%.
Question : Refer to the lift chart: What does the reference line at lift = 1 corresponds to? 1. The predicted lift for the best 50% of validation data cases 2. The predicted lift if the entire population is scored as event cases 3. The predicted lift if none of the population are scored as event cases 4. The predicted lift if 50% of the population are randomly scored as event cases