Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : Assume that you have a data frame in R. Which function would you use to display descriptive
statistics about this variable?

1. levels
2. attributes
3. str
4. summary

Correct Answer : 4
Explanation: summary is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument. Usage
summary(object, ...)
## Default S3 method:
summary(object, ..., digits = max(3, getOption("digits")-3))
## S3 method for class 'data.frame'
summary(object, maxsum = 7,
digits = max(3, getOption("digits")-3), ...)

## S3 method for class 'factor'
summary(object, maxsum = 100, ...)
## S3 method for class 'matrix'
summary(object, ...)
Arguments

object : an object for which a summary is desired.
maxsum : integer, indicating how many levels should be shown for factors.
digits : integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame).
additional arguments affecting the summary produced.
Details : For factors, the frequency of the first maxsum - 1 most frequent levels is shown, and the less frequent levels are summarized in "(Others)" (resulting in at most maxsum frequencies). The functions summary.lm and summary.glm are examples of particular methods which summarize the results produced by lm and glm.

Question : What is the mandatory Clause that must be included when using Window functions?

1. OVER
2. RANK
3. PARTITION BY
4. RANK BY

Correct Answers: 1
Explanation: A window function call always contains an OVER clause following the window function's name and argument(s). This is what syntactically distinguishes it from a regular function or aggregate function. The OVER clause determines exactly how the rows of the query are split up for processing by the window function. The PARTITION BY list within OVER specifies dividing the rows into groups, or partitions, that share the same values of the PARTITION BY expression(s). For each row, the window function is computed across the rows that fall into the same partition as the current row.

Although avg will produce the same result no matter what order it processes the partition's rows in, this is not true of all window functions. When needed, you can control that order using ORDER BY within OVER. Here is an example:

SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;

Question : What is the purpose of the process step "parsing" in text analysis?

1. computes the TF-IDF values for all keywords and indices
2. executes the clustering and classification to organize the contents
3. performs the search and/or retrieval in finding a specific topic or an entity in a document
4. imposes a structure on the unstructured/semi-structured text for downstream analysis

Correct Answer : 4
Explanation: Parsing is the process that takes unstructured text and imposes a structure for further
analysis. The unstructured text could be a plain text file, a weblog, an Extensible Markup
Language (XML) file, a HyperText Markup Language (HTML) file, or a Word document.
Parsing deconstructs the provided text and renders it in a more structured way for the
subsequent steps.

Related Questions

Question : You are given , , user profile pages of an online dating site in XML files, and they are
stored in HDFS. You are assigned to divide the users into groups based on the content of their
profiles. You have been instructed to try K-means clustering on this data. How should you
proceed?

1. Divide the data into sets of 1, 000 user profiles, and run K-means clustering in RHadoop iteratively.
2. Run MapReduce to transform the data, and find relevant key value pairs.
3. Run a Naive Bayes classification as a pre-processing step in HDFS.
4. Partition the data by XML file size, and run K-means clustering in each partition.

Question : The Marketing department of your company wishes to track opinion on a new product that was
recently introduced. Marketing would like to know how many positive and negative reviews are
appearing over a given period and potentially retrieve each review for more in-depth insight.
They have identified several popular product review blogs that historically have published
thousands of user reviews of your company's products.

You have been asked to provide the desired analysis. You examine the RSS feeds for each blog
and determine which fields are relevant. You then craft a regular expression to match your new
product's name and extract the relevant text from each matching review.
What is the next step you should take?

1. Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product
2. Convert the extracted text into a suitable document representation and index into a review corpus
3. Read the extracted text for each review and manually tabulate the results
4. Group the reviews using Naive Bayesian classification

Question : Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is
to a Table as R is to a ______________ .

1. List
2. Matrix
3. Data frame
4. Array

Question : Which word or phrase completes the statement? Unix is to bash as Hadoop is to:

1. NameNode
2. Sqoop
3. HDFS
4. Flume
5. Pig

Question : A call center for a large electronics company handles an average of , support calls a day.
The head of the call center would like to optimize the staffing of the call center during the rollout of
a new product due to recent customer complaints of long wait times. You have been asked to
create a model to optimize call center costs and customer wait times.
The goals for this project include:
1. Relative to the release of a product, how does the call volume change over time?
2. How to best optimize staffing based on the call volume for the newly released product, relative
to old products.
3. Historically, what time of day does the call center need to be most heavily staffed?
4. Determine the frequency of calls by both product type and customer language.
Which goals are suitable to be completed with MapReduce?

1. Goal 2 and 4
2. Goal 1 and 3
3. Goals 1, 2, 3, 4
4. Goals 2, 3, 4

Question : Consider the example of an analysis for fraud detection on credit card usage. You will need to
ensure higher-risk transactions that may indicate fraudulent credit card activity are retained in your
data for analysis, and not dropped as outliers during pre-processing. What will be your approach
for loading data into the analytical sandbox for this analysis?

1. ETL
2. ELT
3. EDW
4. OLTP