Dell EMC Data Science and BigData Certification Questions and Answers

Question : You are working with the Clustering solution of the customer datasets. There are almost variables are available for each customer and almost ,, customer's data is available. You want to reduce
the number of variables for clustering, what would you do?
A. You will randomly reduce the number of variables
B. You will find the correlation among the variables and from their variables are not co-related will be discarded.
C. You will find the correlation among the variables and from the highly co-related variables, you will be considering only one or two variables from it.
D. You cannot discard any variable for creating clusters.
E. You can combine several variables in one variable

1. A,B
2. B,D
3. C,D
4. C,E
5. A,E

Correct Answer : Get Lastest Questions and Answer :
Explanation: When you are applying clustering technique and you find that there are quite a huge number of variables are available. Then it is better the find the co-relation among the variables and
consider only one or two variables from the highly co-related variables. Because highly co-related variable will have the same effect, while creating the cluster. We can use scatter plot matrix among the variables to
find the co-relation.
You can also combine several variables into a single variable. For example if you have two values in the dataset like Asset and Debt than by combining these two values like Debt to Asset ratio and use it while
creating the cluster.

Question : You are having patients' data with the height and age. Where age in years and height in meters. You wanted to create cluster using this two attributes. You wanted to have near equal effect for both
the age and height while creating the cluster. What you can do?
A. You will be adding height with the numeric value 100
B. You will be converting each height value to centimeters
C. You will be dividing both age and height with their respective standard deviation
D. You will be taking square root of height

1. A,B
2. B,C
3. C,D
4. A,D
5. B,D

Correct Answer : Get Lastest Questions and Answer :
Explanation: When you see the data age in years would have values like 50, 60, 70 90 years etc. And while calculating distance from centroid maximum possible value can be 90-0 and its square will be 8100.
While using heights in meter can be 2-0.5(1.5) meters and its square will be 2.25 only. So you can see age has more effect than height. Hence bringing the height on same level you can convert it into centimeters. Can
bring data upto 200 centimeters and then it be more effective like square of 200 maximum.
However, there is another approach is to divide the each value with its standard deviation, which will not have impact of the units e.g. age/sd of the age, which results in value without unit. This can also help in
reducing the effect of units.

Question : Which of the following true with regards to the K-Means clustering algorithm?
A. Labels are not pre-assigned to each objects in the cluster.
B. Labels are pre-assigned to each objects in the cluster.
C. It classify the data based on the labels.
D. It discovers the center of each cluster.
E. It find each objects fall in which particular cluster

1. A,B,C
2. B,C,D
3. C,D,E
4. A,D,E
5. A,C,E

Correct Answer : Get Lastest Questions and Answer :
Explanation: Clustering does not require any predefined labels on the object, rather it consider the attributes on the object. Hence, option-B is out. Clustering is different than classification technique.
Hence you can discard the option-C as well. It does not use the pre-defined labels, hence it is called unsupervised learning and option-A is correct.
Main purpose of the Clustering technique is to determine the center of each Cluster and then find the distance from that center. If object is near the center than it would fall in that particular cluster. Hence,
finally you will have group or clusters created and get to know that objects fall in which particular cluster.

Related Questions

Question : On analyzing your time series data you suspect that the data represented as
y1, y2, y3, ... , yn-1, yn
may have a trend component that is quadratic in nature. Which pattern of data will indicate that
the trend in the time series data is quadratic in nature?

1. (y4-y2) - (y3-y1) = ....= (yn-yn-2)-(yn-1-yn-3)

2. ((y2-y1) /y1 ) * 100% = ....((yn-yn-1)/yn-1) * 100%

3. Access Mostly Uused Products by 50000+ Subscribers

4. (y3-y2) - (y2-y1) = ....= (yn-yn-1)-(yn-1-yn-2)

Question : Which analytical method is considered unsupervised?

1. Naive Bayesian classifier

2. Decision tree
3. Access Mostly Uused Products by 50000+ Subscribers
4. K-means clustering

Question : You have used k-means clustering to classify behavior of , customers for a retail store.
You decide to use household income, age, gender and yearly purchase amount as measures. You
have chosen to use 8 clusters and notice that 2 clusters only have 3 customers assigned. What
should you do?

1. Decrease the number of measures used
2. Increase the number of clusters
3. Access Mostly Uused Products by 50000+ Subscribers
4. Identify additional measures to add to the analysis

Question : What does R code nv <- v[v < ] do?

1. Selects the values in vector v that are less than 1000 and assigns them to the vector nv
2. Sets nv to TRUE or FALSE depending on whether all elements of vector v are less than 1000
3. Access Mostly Uused Products by 50000+ Subscribers
4. Selects values of vector v less than 1000, modifies v, and makes a copy to nv

Question : For which class of problem is MapReduce most suitable?

1. Minimal result data
2. Simple marginalization tasks
3. Access Mostly Uused Products by 50000+ Subscribers
4. Non-overlapping queries

Question : Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?

1. Define the process to maintain the model
2. Try different analytical techniques
3. Access Mostly Uused Products by 50000+ Subscribers
4. Transform existing variables