Dell EMC Data Science and BigData Certification Questions and Answers

Question : A data scientist is asked to implement an article recommendation feature for an on-line magazine.
The magazine does not want to use client tracking technologies such as cookies or reading
history. Therefore, only the style and subject matter of the current article is available for making
recommendations. All of the magazine's articles are stored in a database in a format suitable for
analytics.
Which method should the data scientist try first?

1. K Means Clustering
2. Naive Bayesian
3. Access Mostly Uused Products by 50000+ Subscribers
4. Association Rules

Correct Answer : Get Lastest Questions and Answer :
Explanation: kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum
cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans,
including ones for the initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image
processing, medical, and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified,
labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar
behaviors and spending patterns.

Question : How are window functions different from regular aggregate functions?

1. Rows retain their separate identities and the window function can access more than the current row.
2. Rows are grouped into an output row and the window function can access more than the current row.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Rows are grouped into an output row and the window function can only access the current row.

Correct Answer : Get Lastest Questions and Answer :
Explanation: A window function enables aggregation to occur
but still provides the entire dataset with the summary results. For example, the RANK()
function can be used to order a set of rows based on some attribute.
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike
regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row - the rows retain their separate identities. Behind the scenes, the window function is able to
access more than just the current row of the query result.

Question : Consider these item sets:
(hat, scarf, coat)
(hat, scarf, coat, gloves)
(hat, scarf, gloves)
(hat, gloves)
(scarf, coat, gloves)
What is the confidence of the rule (hat, scarf) -> gloves?

1. 66%
2. 40%
3. Access Mostly Uused Products by 50000+ Subscribers
4. 60%

Correct Answer : Get Lastest Questions and Answer :
Explanation: confidence measures the chance that X and Y appear together in relation to the chance X appears. Confidence can be used to identify the interestingness of the rules. Two of the hat, scarf
combination has gloves out of three
(hat, scarf, coat)
(hat, scarf, coat, gloves)
(hat, scarf, gloves)
2/3 = 66%
Antecedent Consequent
A 0
A 0
A 1
A 0
B 1
B 0
B 1
where the antecedent is the input variable that we can control, and the consequent is the variable we are trying to predict. Real mining problems would typically have more complex antecedents, but usually focus on
single-value consequents. Most mining algorithms would determine the following rules (targeting models):
Rule 1: A implies 0
Rule 2: B implies 1
because these are simply the most common patterns found in the data. A simple review of the above table should make these rules obvious. The confidence for Rule 1 is 3/4 because three of the four records that meet the
antecedent of A meet the consequent of 0. The confidence for Rule 2 is 2/3 because two of the three records that meet the antecedent of B meet the consequent of 1.

Related Questions

Question : You are using the Apriori algorithm to determine the likelihood that a person who owns a home
has a good credit score. You have determined that the confidence for the rules used in the
algorithm is > 75%. You calculate lift = 1.011 for the rule, "People with good credit are
homeowners". What can you determine from the lift calculation?

1. Support for the association is low
2. Leverage of the rules is low
3. Access Mostly Uused Products by 50000+ Subscribers
4. The rule is true

Question : What is an appropriate data visualization to use in a presentation for an analyst audience?

1. Pie chart
2. ROC curve
3. Access Mostly Uused Products by 50000+ Subscribers
4. Stacked bar chart

Question : Consider a database with transactions:
Transaction 1: {cheese, bread, milk}
Transaction 2: {soda, bread, milk}
Transaction 3: {cheese, bread}
Transaction 4: {cheese, soda, juice}
The minimum support is 25%. Which rule has a confidence equal to 50%?

1. {bread} => {milk}
2. {bread, milk} => {cheese}
3. Access Mostly Uused Products by 50000+ Subscribers
4. {bread} => {cheese}

Question A data scientist plans to classify the sentiment polarity of , product reviews collected from
the Internet. What is the most appropriate model to use? Suppose labeled training data is
available.

1. Linear regression

2. Logistic regression

3. Access Mostly Uused Products by 50000+ Subscribers
4. Naive Bayesian classifier

Question : When would you use GROUP BY ROLLUP clause in your OLAP query?

1. where only the subtotals are to be included in the output
2. where only the grand totals are to be included in the output
3. Access Mostly Uused Products by 50000+ Subscribers
in the output
4. where all subtotals and grand totals are to be included in the output

Question : Which type of numeric value does a logistic regression model estimate?

1. A p-value
2. Any integer
3. Access Mostly Uused Products by 50000+ Subscribers
4. Any real number