Dell EMC Data Science and BigData Certification Questions and Answers

Question : Which word or phrase completes the statement? A data warehouse is to a centralized database
for reporting as an analytic sandbox is to a _______?

1. Collection of data assets for modeling

2. Collection of low-volume databases
3. Access Mostly Uused Products by 50000+ Subscribers

4. Collection of data assets for ETL

Correct Answer : Get Lastest Questions and Answer : Exp: Data Warehouse : A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to : Congregate data from multiple
sources into a single database so a single query engine can be used to present data.
Mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long running, analysis queries in transaction processing databases.
Maintain data history, even if the source transaction systems do not.
Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.
Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.
Present the organization's information consistently.
Provide a single common data model for all data of interest regardless of the data's source.
Restructure the data so that it makes sense to the business users.
Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems.
Add value to operational business applications, notably customer relationship management (CRM) systems.
Make decision-support queries easier to write. Centralized data containers in a purpose-built space
Supports BI and reporting, but restricts robust analyses
Analyst dependent on IT and DBAs for data access and schema
changes
Analysts must spend significant time to get aggregated and
disaggregated data extracts from multiple sources.

Question : You do a Students t-test to compare the average test scores of sample groups from populations A
and B. Group A averaged 10 points higher than group B. You find that this difference is significant,
with a p-value of 0.03. What does that mean?

1. There is a 3% chance that you have identified a difference between the populations when in
reality there is none.
2. The difference in scores between a sample from population A and a sample from population B
will tend to be within 3% of 10 points.
3. Access Mostly Uused Products by 50000+ Subscribers
sample group from population B.
4. There is a 97% chance that a sample group from population A will score 10 points higher that a
sample group from population B.

Correct Answer : Get Lastest Questions and Answer :
Explanation: P values evaluate how well the sample data support the devil's advocate argument that the null hypothesis is true. It measures how compatible your data are with the null hypothesis. How
likely is the effect observed in your sample data if the null hypothesis is true?
High P values: your data are likely with a true null.
Low P values: your data are unlikely with a true null.
A low P value suggests that your sample provides enough evidence that you can reject the null hypothesis for the entire population. How Do You Interpret P Values? VaccineIn technical terms, a P value is the
probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis. For example, suppose that a vaccine study produced a P value of 0.04. This P value
indicates that if the vaccine had no effect, you'd obtain the observed difference or more in 4% of studies due to random sampling error. P values address only one question: how likely are your data, assuming a true
null hypothesis? It does not measure support for the alternative hypothesis. This limitation leads us into the next section to cover a very common misinterpretation of P values. P Values Are NOT the Probability of
Making a Mistake Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).
There are several reasons why P values can't be the error rate. First, P values are calculated based on the assumptions that the null is true for the population and that the difference in the sample is caused entirely
by random chance. Consequently, P values can't tell you the probability that the null is true or false because it is 100% true from the perspective of the calculations. Second, while a low P value indicates that your
data are unlikely assuming a true null, it can't evaluate which of two competing cases is more likely: The null is true but your sample was unusual.
The null is false.
Determining which case is more likely requires subject area knowledge and replicate studies. Let's go back to the vaccine study and compare the correct and incorrect way to interpret the P value of 0.04: Correct:
Assuming that the vaccine had no effect, you'd obtain the observed difference or more in 4% of studies due to random sampling error.
Incorrect: If you reject the null hypothesis, there's a 4% chance that you're making a mistake. To see a graphical representation of how hypothesis tests work, see my post: Understanding Hypothesis Tests: Significance
Levels and P Values. What Is the True Error Rate? Caution signThink that this interpretation difference is simply a matter of semantics, and only important to picky statisticians? Think again. It's important to you.
If a P value is not the error rate, what the heck is the error rate? (Can you guess which way this is heading now?) Sellke et al.* have estimated the error rates associated with different P values. While the precise
error rate depends on various assumptions (which I discuss here), the table summarizes them for middle-of-the-road assumptions.
P value
Probability of incorrectly rejecting a true null hypothesis
0.05
At least 23% (and typically close to 50%)
0.01
At least 7% (and typically close to 15%)
Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is
justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. That can be costly!

Question : What is one modeling or descriptive statistical function in MADlib that is typically not provided in a
standard relational database?

1. Expected value
2. Variance
3. Access Mostly Uused Products by 50000+ Subscribers

4. Quantiles

Correct Answer : Get Lastest Questions and Answer :

Explanation: Linear regression models a linear relationship of a scalar dependent variable y to one or more explanatory independent variables x to build a model of coefficients.

Related Questions

Question : How does Pig's use of a schema differ from that of a traditional RDBMS?

1. Pig's schema requires that the data is physically present when the schema is defined
2. Pig's schema is required for ETL
3. Access Mostly Uused Products by 50000+ Subscribers
4. Pig's schema is optional

Question : You are provided four different datasets. Initial analysis on these datasets show that they have
identical mean, variance and correlation values. What should your next step in the analysis be?

1. Select one of the four datasets and begin planning and building a model
2. Combine the data from all four of the datasets and begin planning and bulding a model
3. Access Mostly Uused Products by 50000+ Subscribers
4. Visualize the data to further explore the characteristics of each data set

Question : You are asked to create a model to predict the total number of monthly subscribers for a specific
magazine. You are provided with 1 year's worth of subscription and payment data, user
demographic data, and 10 years worth of content of the magazine (articles and pictures). Which
algorithm is the most appropriate for building a predictive model for subscribers?

1. TF-IDF
2. Linear regression
3. Access Mostly Uused Products by 50000+ Subscribers
4. Decision trees

Question : Which word or phrase completes the statement? Structured data is to OLAP data as quasistructured
data is to____

1. Text documents
2. XML data
3. Access Mostly Uused Products by 50000+ Subscribers
4. Image files

Question : What describes a true property of Logistic Regression method?

1. It handles missing values well.
2. It works well with discrete variables that have many distinct values.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It works well with variables that affect the outcome in a discontinuous way.

Question : You have been assigned to do a study of the daily revenue effect of a pricing model of online
transactions. You have tested all the theoretical models in the previous model planning stage, and
all tests have yielded statistically insignificant results. What is your next step?

1. Run all the models again against a larger sample, leveraging more historical data.
2. Report that the results are insignificant, and reevaluate the original business question.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Modify samples used by the models and iterate until a significant result occurs.