Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : What is an example of a null hypothesis?

1. that a newly created model provides a prediction of a null sample mean
2. that a newly created model does not provide better predictions than the currently existing model
3. Access Mostly Uused Products by 50000+ Subscribers
4. that a newly created model provides a prediction that will be well fit to the null distribution

Correct Answer : Get Lastest Questions and Answer :

Explanation: Hypothesis testing requires constructing a statistical model of what the world would look like given that chance or random processes alone were responsible for the results. The hypothesis that chance alone is responsible for the results is called the null hypothesis. The model of the result of the random process is called the distribution under the null hypothesis. The obtained results are then compared with the distribution under the null hypothesis, and the likelihood of finding the obtained results is thereby determined.[3]

Hypothesis testing works by collecting data and measuring how likely the particular set of data is, assuming the null hypothesis is true, when the study is on a randomly-selected representative sample. The null hypothesis assumes no relationship between variables in the population from which the sample is selected.

If the data-set of a randomly-selected representative sample is very unlikely relative to the null hypothesis (defined as being part of a class of sets of data that only rarely will be observed), the experimenter rejects the null hypothesis concluding it (probably) is false. This class of data-sets is usually specified via a test statistic which is designed to measure the extent of apparent departure from the null hypothesis. The procedure works by assessing whether the observed departure measured by the test statistic is larger than a value defined so that the probability of occurrence of a more extreme value is small under the null hypothesis (usually in less than either 5% or 1% of similar data-sets in which the null hypothesis does hold).

If the data do not contradict the null hypothesis, then only a weak conclusion can be made: namely, that the observed data set provides no strong evidence against the null hypothesis. In this case, because the null hypothesis could be true or false, in some contexts this is interpreted as meaning that the data give insufficient evidence to make any conclusion; in other contexts it is interpreted as meaning that there is no evidence to support changing from a currently useful regime to a different one.

For instance, a certain drug may reduce the chance of having a heart attack. Possible null hypotheses are "this drug does not reduce the chances of having a heart attack" or "this drug has no effect on the chances of having a heart attack". The test of the hypothesis consists of administering the drug to half of the people in a study group as a controlled experiment. If the data show a statistically significant change in the people receiving the drug, the null hypothesis is rejected.

Question : You have fit a decision tree classifier using input variables. The resulting tree used of the
variables, and is 5 levels deep. Some of the nodes contain only 3 data points. The AUC of the
model is 0.85. What is your evaluation of this model?

1. The tree did not split on all the input variables. You need a larger data set to get a more accurate model.
2. The AUC is high, and the small nodes are all very pure. This is an accurate model.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The AUC is high, so the overall model is accurate. It is not well-calibrated, because the small nodes will give poor estimates of probability.

Correct Answer : Get Lastest Questions and Answer :
Explanation: Area Under the Receiver Operating Characteristic Curve): There are no universal rules of thumb with the AUC, ever ever ever.

What the AUC is is the probability that a randomly sampled positive (or case) will have a higher marker value than a negative (or control) because the AUC is mathematically equivalent to the U statistic.

What the AUC is not is a standardized measure of predictive accuracy. Highly deterministic events can have single predictor AUCs of 95% or higher (such as in controlled mechatronics, robotics, or optics), some complex multivariable logistic risk prediction models have AUCs of 64% or lower such as breast cancer risk prediction, and those are respectably high levels of predictive accuracy.

A sensible AUC value, as with a power analysis, is prespecified by gathering knowledge of the background and aims of a study apriori. The doctor/engineer describes what they want, and you, the statistician, resolve on a target AUC value for your predictive model. Then begins the investigation.

It is indeed possible to overfit a logistic regression model. Aside from linear dependence (if the model matrix is of deficient rank), you can also have perfect concordance, or that is the plot of fitted values against Y perfectly discriminates cases and controls. In that case, your parameters have not converged but simply reside somewhere on the boundary space that gives a likelihood of 8. Sometimes, however, the AUC is 1 by random chance alone.

There's another type of bias that arises from adding too many predictors to the model, and that's small sample bias. In general, the log odds ratios of a logistic regression model tend toward a biased factor of 2B because of non-collapsibility of the odds ratio and zero cell counts. In inference, this is handled using conditional logistic regression to control for confounding and precision variables in stratified analyses. However, in prediction, you're SooL. There is no generalizable prediction when you have p>>np(1-p), (p=Prob(Y=1)) because you're guaranteed to have modeled the "data" and not the "trend" at that point. High dimensional (large p) prediction of binary outcomes is better done with machine learning methods. Understanding linear discriminant analysis, partial least squares, nearest neighbor prediction, boosting, and random forests would be a very good place to start.

Question : If your intention is to show trends over time, which chart type is the most appropriate way to depict the data?

1. Line chart
2. Bar chart
3. Access Mostly Uused Products by 50000+ Subscribers
4. Histogram

Correct Answer : Get Lastest Questions and Answer :

Explanation: A line chart or line graph is a type of chart which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time - a time series - thus the line is often drawn chronologically. In these cases they are known as run charts

Related Questions

Question : Which of the following is not the Classification algorithm?

1. Logistic Regression
2. Support Vector Machine
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hidden Markov Models
5. None of the above

Question : Suppose a man told you he had a nice conversation with someone on the train. Not knowing anything
about this conversation, the probability that he was speaking to a woman is 50% (assuming the train had an equal
number of men and women and the speaker was as likely to strike up a conversation with a man as with a woman).
Now suppose he also told you that his conversational partner had long hair. It is now more likely he was speaking
to a woman, since women are more likely to have long hair than men. ____________ can be used to calculate
the probability that the person was a woman.

1. SVM
2. MLE
3. Access Mostly Uused Products by 50000+ Subscribers
4. Logistic Regression

Question : Bayes' theorem cannot finds the actual probability of an event from the results of your tests?

1. True
2. False

Question : You are creating a regression model with the input income, education and current debt of a customer,
what could be the possible output from this model.

1. Customer fit as a good
2. Customer fit as acceptable or average category
3. Access Mostly Uused Products by 50000+ Subscribers
4. 1 and 3 are correct
5. 2 and 3 are correct

Question : In which of the scenario you can use the regression to predict the values

1. Samsung can use it for mobile sales forecast
2. Mobile companies can use it to forecast manufacturing defects
3. Access Mostly Uused Products by 50000+ Subscribers
4. Only 1 and 2
5. All 1 , 2 and 3

Question : You are creating a Classification process where input is the income, education and
current debt of a customer, what could be the possible output of this process.

1. Probability of the customer default on loan repayment
2. Percentage of the customer loan repayment capability
3. Access Mostly Uused Products by 50000+ Subscribers
4. The output might be a risk class, such as "good", "acceptable", "average", or "unacceptable".
5. All of the above