Dell EMC Data Science and BigData Certification Questions and Answers

Question : You are working as a data scientists in a retail chain company. To you and your team have been given a project to implement recommendation engines for the products which company is selling online and you
decided to create an analytics sandbox. So which of the following you are trying to achieve?

1. You are creating a Hive table in Hadoop Framework.

2. You are defining the SQL queries for extracting the data.

3. You are estimating the size of the datasets and planning in total 5 to 10 time storage size for the data.

4. You would be transforming your semi-structured data into well formatted data and saving into csv file.

5. You are selecting the Advanced Analytics model.

Correct Answer : 3
Explanation: Sandbox and workspace can be used interchangeably. And you will be exploring the datasets while designing the sandboxes. This is a separate copy of the data then your production data, so that
you can run in-memory analytics on this data without affecting your production load. Generally a sandbox contains raw data, aggregated data or data with the various formats. You would be planning 5-10 time size of the
data for the analytics sandboxes. Because you will be creating various copy of the data.

Question : You are working with a training company which provides online trainings in various profession. You have received the data for further analysis which are already transformed and structured. You find that
there is a high correlation between course category, course watched and number of hours training watched. You need to use some technique to handle this highly co-related variable, which of the below you will be using?

1. You will take a square root of each variable, so that correlation can be removed.

2. You will be discarding these all three variables.

3. You would be using normalizing technique so that three variables become equal in size.

4. You will be creating a new variable which is a function of these three correlated variable.

Correct Answer : 4
Explanation: Handling correlated variables: Choose based on which one is more logically connected to what you're trying to predict, or else go with the one that correlates most strongly with the outcome
variable. The other option is to create a new variable by combining them. If you go this route, I'd suggest converting the two variables onto a similar scale (maybe a z score), then summing them. This may give you a
more valid measure of the underlying construct. For example, age and experience may both be related to judgment.

Question : You are doing advanced analytics for the one of the medical application using the regression and you have two variables which are weight and height and they are very important input variables, which
cannot be ignored and they are also highly co-related. What is the best solution for that?

1. You will take cube root of height

2. You will take square root of weight

3. You will take square of the height.

4. You would consider using BMI (Body Mass Index)

Correct Answer : 4
Explanation: If multiple variables are highly co-related then it is better you consider using the either of the variable which correlates more (which is not in the given option) or go for the new variable
which is a function of the both the variable in this case it could be BMI (Body Mass Index). Because it is a function of both weight and height as per the below formula.
BMI = Weight/(Height * Height)

Related Questions

Question : You have been given a huge datasets with the following occurrences
Bread is 80% of the time in all transactions, combination of bread and milk is 60% of the time in all transactions. Which of the following statement is correct with regards to Apriori?

1. Support for {bread} is 0.8

2. Support for {bread} is 0.6

3. Support for {bread} is 1.4

4. Support for {bread} is 0.2

Question : For Apriori algorithm you have decided that minimum support value is ., which of the following are frequent itemsets, if following percentage occurrences are given?
Bread->80%
Milk->70%
Bread,Milk -> 55%
Bread, Banana -> 30%
A. Bread
B. Milk
C. Bread, Milk
D. Banana
E. Bread, Banana

1. A,B,C
2. B,C,D
3. C,D,E
4. A,D,E
5. A,C,E

Question : You have been given combination of three item sets as {a,b,c} are having . support and minimum support is defined as .. So which of the following statement is correct?
A. Combination of {a,b} are frequent item sets
B. Combination of {b,c} are frequent item sets
C. Combination of {a,c} are frequent item sets
D. Item {a} is a frequent dataset
E. Item {c} is a frequent dataset

1. A,B,C
2. B,C,D
3. B,C,D,E
4. A,B,C,D
5. A,B,C,D,E

Question : Which of the following is true with regards to the Apriori Algorithms?
A. Algorithm starts with the combination of all the distinct item, to find the frequent itemset and in next iteration, it reduces one item from that frequent Itemset.
B. Algorithm starts with one distinct item, to find the frequent itemset and in next iteration, it add one item to find the frequent itemset.
C. If combination has frequent itemset than its subset will also be frequent dataset.
D. If combination has frequent itemset than it does not guarantee that subset of that combination will also be frequent dataset.

1. A,B
2. B,C
3. C,D
4. A,D
5. B,D

Question : If you have Association Rule as X->Y, which of the below represent the Confidence?

1. Support for {X}/Support for{X,Y}

2. Support for {X,Y}/Support for{X}

3. Access Mostly Uused Products by 50000+ Subscribers

4. Support for {Y}/Support for{X}

Question : In the Apriori algorithm which statement is true with regards to Confidence for Association Rule {X->Y}?
A. It consider antecedent {X}
B. It consider consequent {Y}
C. It consider co-occurrence of {X,Y}
D. It does not consider consequent {Y}
E. Confidence cannot tell if a rule contains true implication of the relationship of if the rule is purely coincidental.

1. A,B,C
2. B,C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. A,C,D,E
5. A,B,C,D,E