Premium

Dell EMC Data Science and BigData Certification Questions and Answers



Question : You have been given two training datasets for different courses in same subject as HEPop and HEPop. You need to find that both the population are same or not. Hence, you decided to use Hypothesis
testing using the t-test. You assume that both the populations have equal distribution and their variance is not known. To conduct the t-test you need to calculate t-statistic, which of the following you would require
for calculating t-statistic?
A. Sample Mean
B. Sample Size
C. Standard Deviation of Sample
D. Population size
E. Mean of the each population
F. Samples Average of Mean values

 : You have been given two training datasets for different courses in same subject as HEPop and HEPop. You need to find that both the population are same or not. Hence, you decided to use Hypothesis
1. A,B,C
2. C,D,E
3. D,E,F
4. A,B,E
5. A,C,F

Correct Answer : 1
Explanation: When you conduct the Hypothesis testing using the t-test, you use samples and not the actual population. Because generally population is quite big and you cannot run the test on the
population. Hence, option D and E is out.
If each population is normally distributed and having the same mean (u1=u2) and the same variance. Then shape of t-distribution is similar to normal distribution. Even if degrees of freedom (n1+n2-2) i.e. (Sample Size
from HEPop1 + Sample Size from HEPop2-2) is reaches close to 30 or more than t-distribution is nearly identical to the normal distribution.
If t-statistics is high than, then you can reject the Null Hypothesis (U1=U2)
One of the key assumption made in the t-test is that variance of both the sample is equal.





Question : You are working in a Healthcare company in America, and wanted to measure the weights. As you cannot measure the weight of all the people in entire population. Which of the following formula would be
helpful for this requirement?
 : You are working in a Healthcare company in America, and wanted to measure the weights. As you cannot measure the weight of all the people in entire population. Which of the following formula would be
1. A
2. B
3. C
4. D

Correct Answer : 4
Explanation: To calculate the population weight, you can use sample variance. Because it is not possible to take the weight of entire population. Variance is the average of the squared difference from the
mean.
However, variance is not that useful to conclude what is weight of the population, because values are squared. Hence, you have to take square root of the variance and that value is called standard deviation.
Suppose your standard deviation is = 80 then variance will be 1600. Suppose mean of the sample is 140. Then you can conclude that majority of the peoples weight will be (mean-sd) and (mean +sd). In this case it would
be (140-80) to (140 + 80) = 60 to 210.





Question : Which of the following hypothesis testing would be more reliable when you have unequal variance and unequal sample sizes?


 : Which of the following hypothesis testing would be more reliable when you have unequal variance and unequal sample sizes?
1. Student t-tests

2. Welch t-test

3. Logistic regression

4. Linear regression

5. Wilcoxon rank-sum test


Correct Answer : 2
Explanation: As you can see we don't use Logistic, Linear for the Hypothesis testing. So you can discard these three options. Remaining 3 options are actual t-tests. Let's see what the Welch's t-test from
Wikipedia is
In statistics, Welch's t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. Welch's t-test is an adaptation of Student's
t-test, and is more reliable when the two samples have unequal variances and unequal sample sizes. These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when
the statistical units underlying the two samples being compared are non-overlapping. Given that Welch's t-test has been less popular than Student's t-test [2] and may be less familiar to readers, a more informative
name is "Welch's unequal variances t-test" or "unequal variances t-test" for brevity.
Student t-test is used when we assume that sample has the equal variance.



Related Questions


Question : You have run the association rules algorithm on your data set, and the two rules {banana, apple}
=> {grape} and {apple, orange}=> {grape} have been found to be relevant. What else must be true?


  : You have run the association rules algorithm on your data set, and the two rules {banana, apple}
1. {grape, apple, orange} must be a frequent itemset.
2. {banana, apple, grape, orange} must be a frequent itemset.
3. Access Mostly Uused Products by 50000+ Subscribers
4. {banana, apple} => {orange} must be a relevant rule.


similar interests. For example, association rules may suggest that those customers who have bought product A have also bought product B, or those customers who have bought products A, B, and C are more similar to this
customer. These findings provide opportunities for retailers to cross-sell their products. Association rule mining is primarily focused on finding frequent co-occurring associations among a collection of items. It is
sometimes referred to as "Market Basket Analysis", since that was the original application area of association mining. The goal is to find associations of items that occur together more often than you would expect
from a random sampling of all possibilities. The classic example of this is the famous Beer and Diapers association that is often mentioned in data mining books. The story goes like this: men who go to the store to
buy diapers will also tend to buy beer at the same time. Let us illustrate this with a simple example. Suppose that a store's retail transactions database includes the following information:

There are 600,000 transactions in total.
7,500 transactions contain diapers (1.25 percent)
60,000 transactions contain beer (10 percent)
6,000 transactions contain both diapers and beer (1.0 percent)
If there was no association between beer and diapers (i.e., they are statistically independent), then we expect only 10% of diaper purchasers to also buy beer (since 10% of all customers buy beer). However, we
discover that 80% (=6000/7500) of diaper purchasers also buy beer. This is a factor of 8 increase over what was expected - that is called Lift, which is the ratio of the observed frequency of co-occurrence to the
expected frequency. This was determined simply by counting the transactions in the database. So, in this case, the association rule would state that diaper purchasers will also buy beer with a Lift factor of 8. In
statistics, Lift is simply estimated by the ratio of the joint probability of two items x and y, divided by the product of their individual probabilities: Lift = P(x,y)/[P(x)P(y)]. If the two items are statistically
independent, then P(x,y)=P(x)P(y), corresponding to Lift = 1 in that case. Note that anti-correlation yields Lift values less than 1, which is also an interesting discovery - corresponding to mutually exclusive items
that rarely co-occur together.



Question : When would you use a Wilcoxson Rank Sum test?
  : You have run the association rules algorithm on your data set, and the two rules {banana, apple}
1. When the data can easily be sorted
2. When the populations represent the sums of other values
3. Access Mostly Uused Products by 50000+ Subscribers
4. When you cannot make an assumption about the distribution of the populations


Question : In the MapReduce framework, what is the purpose of the Reduce function?

  : In the MapReduce framework, what is the purpose of the Reduce function?
1. It writes the output of the Map function to storage
2. It breaks the input into smaller components and distributes to other nodes in the cluster
3. Access Mostly Uused Products by 50000+ Subscribers
4. It distributes the input to multiple nodes for processing



Question : Which of the following is an example of quasi-structured data?
  : Which of the following is an example of quasi-structured data?
1. OLAP
2. Customer record table
3. Access Mostly Uused Products by 50000+ Subscribers
4. OLTP




Question : A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse
contains data collected from many sources and transformed through a complex, multi-stage ETL
process. What is a concern the data scientist should have about the data?


  : A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse
1. It is too processed
2. It is not structured
3. Access Mostly Uused Products by 50000+ Subscribers
4. It is too centralized






Question : Which word or phrase completes the statement? Emphasis color is to standard color as _______ .


  : Which word or phrase completes the statement? Emphasis color is to standard color as _______ .
1. Main message is to key findings
2. Frequent item set is to item
3. Access Mostly Uused Products by 50000+ Subscribers
4. Pie chart is to proportions




Question : Which data asset is an example of semi-structured data?

  : Which data asset is an example of semi-structured data?
1. XML data file
2. Database table
3. Access Mostly Uused Products by 50000+ Subscribers
4. News article