Dell EMC Data Science and BigData Certification Questions and Answers

Question : What is an example of a null hypothesis?

1. that a newly created model provides a prediction of a null sample mean
2. that a newly created model does not provide better predictions than the currently existing model
3. Access Mostly Uused Products by 50000+ Subscribers
4. that a newly created model provides a prediction that will be well fit to the null distribution

Correct Answer : Get Lastest Questions and Answer :

Explanation: Hypothesis testing requires constructing a statistical model of what the world would look like given that chance or random processes alone were responsible for the results. The hypothesis that chance alone is
responsible for the results is called the null hypothesis. The model of the result of the random process is called the distribution under the null hypothesis. The obtained results are then compared with the
distribution under the null hypothesis, and the likelihood of finding the obtained results is thereby determined.[3]

Hypothesis testing works by collecting data and measuring how likely the particular set of data is, assuming the null hypothesis is true, when the study is on a randomly-selected representative sample. The null
hypothesis assumes no relationship between variables in the population from which the sample is selected.

If the data-set of a randomly-selected representative sample is very unlikely relative to the null hypothesis (defined as being part of a class of sets of data that only rarely will be observed), the experimenter
rejects the null hypothesis concluding it (probably) is false. This class of data-sets is usually specified via a test statistic which is designed to measure the extent of apparent departure from the null hypothesis.
The procedure works by assessing whether the observed departure measured by the test statistic is larger than a value defined so that the probability of occurrence of a more extreme value is small under the null
hypothesis (usually in less than either 5% or 1% of similar data-sets in which the null hypothesis does hold).

If the data do not contradict the null hypothesis, then only a weak conclusion can be made: namely, that the observed data set provides no strong evidence against the null hypothesis. In this case, because the null
hypothesis could be true or false, in some contexts this is interpreted as meaning that the data give insufficient evidence to make any conclusion; in other contexts it is interpreted as meaning that there is no
evidence to support changing from a currently useful regime to a different one.

For instance, a certain drug may reduce the chance of having a heart attack. Possible null hypotheses are "this drug does not reduce the chances of having a heart attack" or "this drug has no effect on the chances of
having a heart attack". The test of the hypothesis consists of administering the drug to half of the people in a study group as a controlled experiment. If the data show a statistically significant change in the
people receiving the drug, the null hypothesis is rejected.

Question : You have fit a decision tree classifier using input variables. The resulting tree used of the
variables, and is 5 levels deep. Some of the nodes contain only 3 data points. The AUC of the
model is 0.85. What is your evaluation of this model?

1. The tree did not split on all the input variables. You need a larger data set to get a more accurate model.
2. The AUC is high, and the small nodes are all very pure. This is an accurate model.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The AUC is high, so the overall model is accurate. It is not well-calibrated, because the small nodes will give poor estimates of probability.

Correct Answer : Get Lastest Questions and Answer :
Explanation: Area Under the Receiver Operating Characteristic Curve): There are no universal rules of thumb with the AUC, ever ever ever.

What the AUC is is the probability that a randomly sampled positive (or case) will have a higher marker value than a negative (or control) because the AUC is mathematically equivalent to the U statistic.

What the AUC is not is a standardized measure of predictive accuracy. Highly deterministic events can have single predictor AUCs of 95% or higher (such as in controlled mechatronics, robotics, or optics), some complex
multivariable logistic risk prediction models have AUCs of 64% or lower such as breast cancer risk prediction, and those are respectably high levels of predictive accuracy.

A sensible AUC value, as with a power analysis, is prespecified by gathering knowledge of the background and aims of a study apriori. The doctor/engineer describes what they want, and you, the statistician, resolve on
a target AUC value for your predictive model. Then begins the investigation.

It is indeed possible to overfit a logistic regression model. Aside from linear dependence (if the model matrix is of deficient rank), you can also have perfect concordance, or that is the plot of fitted values
against Y perfectly discriminates cases and controls. In that case, your parameters have not converged but simply reside somewhere on the boundary space that gives a likelihood of 8. Sometimes, however, the AUC is 1
by random chance alone.

There's another type of bias that arises from adding too many predictors to the model, and that's small sample bias. In general, the log odds ratios of a logistic regression model tend toward a biased factor of 2B
because of non-collapsibility of the odds ratio and zero cell counts. In inference, this is handled using conditional logistic regression to control for confounding and precision variables in stratified analyses.
However, in prediction, you're SooL. There is no generalizable prediction when you have p>>np(1-p), (p=Prob(Y=1)) because you're guaranteed to have modeled the "data" and not the "trend" at that point. High
dimensional (large p) prediction of binary outcomes is better done with machine learning methods. Understanding linear discriminant analysis, partial least squares, nearest neighbor prediction, boosting, and random
forests would be a very good place to start.

Question : If your intention is to show trends over time, which chart type is the most appropriate way to depict the data?

1. Line chart
2. Bar chart
3. Access Mostly Uused Products by 50000+ Subscribers
4. Histogram

Correct Answer : Get Lastest Questions and Answer :

Explanation: A line chart or line graph is a type of chart which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. It is
similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals
of time - a time series - thus the line is often drawn chronologically. In these cases they are known as run charts

Related Questions

Question : What are the key outcomes of the successful analytical projects?
A. Code of the model
B. Technical specifications
C. Presentations for the Analysts
D. Presentation for Project Sponsors

1. A,B
2. B,C
3. A,C,D
4. B,C,D
5. A,B,C,D

Question : Order the all the steps correctly which you can follow while implementing advance analytics data science project?
A. Discovery
B. Data Preparations
C. Creating Models
D. Executing Models
E. Creating visuals from the outcome
F. Operationalize the models

1. A,,D,E,F
2. C,D,E,F
3. A,B,C,D
4. B,C,D,E
5. A,B,C,D,E

Question : You are working on a Data Science project and during the project you have been gibe a responsibility to interview all the stakeholders in the project. In which phase of the project you are?

1. Discovery

2. Data Preparations

3. Creating Models

4. Executing Models

5. Creating visuals from the outcome
F. Operationalize the models

Question : You are working as a data scientists in a retail chain company. To you and your team have been given a project to implement recommendation engines for the products which company is selling online and you
decided to create an analytics sandbox. So which of the following you are trying to achieve?

1. You are creating a Hive table in Hadoop Framework.

2. You are defining the SQL queries for extracting the data.

3. You are estimating the size of the datasets and planning in total 5 to 10 time storage size for the data.

4. You would be transforming your semi-structured data into well formatted data and saving into csv file.

5. You are selecting the Advanced Analytics model.

Question : You are working with a training company which provides online trainings in various profession. You have received the data for further analysis which are already transformed and structured. You find that
there is a high correlation between course category, course watched and number of hours training watched. You need to use some technique to handle this highly co-related variable, which of the below you will be using?

1. You will take a square root of each variable, so that correlation can be removed.

2. You will be discarding these all three variables.

3. You would be using normalizing technique so that three variables become equal in size.

4. You will be creating a new variable which is a function of these three correlated variable.

Question : You are doing advanced analytics for the one of the medical application using the regression and you have two variables which are weight and height and they are very important input variables, which
cannot be ignored and they are also highly co-related. What is the best solution for that?

1. You will take cube root of height

2. You will take square root of weight

3. You will take square of the height.

4. You would consider using BMI (Body Mass Index)