Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : A data scientist is asked to implement an article recommendation feature for an on-line magazine.
The magazine does not want to use client tracking technologies such as cookies or reading
history. Therefore, only the style and subject matter of the current article is available for making
recommendations. All of the magazine's articles are stored in a database in a format suitable for
analytics.
Which method should the data scientist try first?

1. K Means Clustering
2. Naive Bayesian
3. Logistic Regression
4. Association Rules

Correct Answer 1 :
Explanation: kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image processing, medical, and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified,
labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar
behaviors and spending patterns.

Question : How are window functions different from regular aggregate functions?

1. Rows retain their separate identities and the window function can access more than the current row.
2. Rows are grouped into an output row and the window function can access more than the current row.
3. Rows retain their separate identities and the window function can only access the current row.
4. Rows are grouped into an output row and the window function can only access the current row.

Correct Answer 1 :
Explanation: A window function enables aggregation to occur
but still provides the entire dataset with the summary results. For example, the RANK()
function can be used to order a set of rows based on some attribute.
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row - the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

Question : Consider these item sets:
(hat, scarf, coat)
(hat, scarf, coat, gloves)
(hat, scarf, gloves)
(hat, gloves)
(scarf, coat, gloves)
What is the confidence of the rule (hat, scarf) -> gloves?

1. 66%
2. 40%
3. 50%
4. 60%

Correct Answer 1 :
Explanation: confidence measures the chance that X and Y appear together in relation to the chance X appears. Confidence can be used to identify the interestingness of the rules. Two of the hat, scarf combination has gloves out of three
(hat, scarf, coat)
(hat, scarf, coat, gloves)
(hat, scarf, gloves)
2/3 = 66%
Antecedent Consequent
A 0
A 0
A 1
A 0
B 1
B 0
B 1
where the antecedent is the input variable that we can control, and the consequent is the variable we are trying to predict. Real mining problems would typically have more complex antecedents, but usually focus on single-value consequents. Most mining algorithms would determine the following rules (targeting models):
Rule 1: A implies 0
Rule 2: B implies 1
because these are simply the most common patterns found in the data. A simple review of the above table should make these rules obvious. The confidence for Rule 1 is 3/4 because three of the four records that meet the antecedent of A meet the consequent of 0. The confidence for Rule 2 is 2/3 because two of the three records that meet the antecedent of B meet the consequent of 1.

Related Questions

Question : In the MapReduce framework, what is the purpose of the Map Function?

1. It processes the input and generates key-value pairs
2. It collects the output of the Reduce function
3. It sorts the results of the Reduce function
4. It breaks the input into smaller components and distributes to other nodes in the cluster

Question : You have completed your model and are handing it off to be deployed in production. What should
you deliver to the production team, along with your commented code?

1. The production team needs to understand how your model will interact with the processes they
already support. Give them documentation on expected model inputs and outputs, and guidance
on error-handling.
2. The production team are technical, and they need to understand how the processes that they
support work, so give them the same presentation that you prepared for the analysts.
3. The production team supports the processes that run the organization, and they need context
to understand how your model interacts with the processes they already support. Give them the
same presentation that you prepared for the project sponsor.
4. The production team supports the processes that run the organization, and they need context
to understand how your model interacts with the processes they already support. Give them the
executive summary.

Question : While having a discussion with your colleague, this person mentions that they want to perform Kmeans
clustering on text file data stored in HDFS.
Which tool would you recommend to this colleague?

1. Sqoop
2. Scribe
3. HBase
4. Mahout

Question : Which method is used to solve for coefficients b, b, .., bn in your linear regression model :
Y = b0 + b1x1+b2x2+ .... +bnxn

1. Apriori Algorithm
2. Ridge and Lasso
3. Ordinary Least squares
4. Integer programming

Question : What describes a true limitation of Logistic Regression method?

1. It does not handle redundant variables well.
2. It does not handle missing values well.
3. It does not handle correlated variables well.
4. It does not have explanatory values.

Question : You submit a MapReduce job to a Hadoop cluster and notice that although the job was
successfully submitted, it is not completing. What should you do?

1. Ensure that the NameNode is running
2. Ensure that the JobTracker is running
3. Ensure that the TaskTracker is running.
4. Ensure that a DataNode is running