Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : A disk drive manufacturer has a defect rate of less than .% with % confidence. A quality
assurance team samples 1000 disk drives and finds 14 defective units. Which action should the
team recommend?

1. A larger sample size should be taken to determine if the plant is operating correctly
2. The manufacturing process is functioning properly and no further action is required
3. A smaller sample size should be taken to determine if the plant is operating correctly
4. There is a flaw in the quality assurance process and the sample should be repeated

Correct Answer : 2
Explanation: Defect rate is less than 1.5% (which is equal to 15 out of 1000 and confidence level is also high 98%) So need to take action.

Question : What is a core deliverable at the end of the analytic project?

1. An implemented database design
2. A whitepaper describing the project and the implementation
3. A presentation for project sponsors
4. The training materials

Correct Answer : 3
Explanation: Four main deliverables:
Presentation for Project Sponsors contains high-level takeaways for executivelevel stakeholders, with a few key messages to aid their decision-making process. Focus on clean, easy visuals for the presenter to explain and for the viewer to grasp.

Presentation for Analysts, which describes changes to business processes and reports. Data scientists reading this presentation are comfortable with technical graphs (such as Receiver Operating Characteristic [ROC] curves, density plots, and histograms) and will be interested in the details.

Code for technical people, such as engineers and others managing the production environment

Technical specifications for implementing the code

Question : You have been assigned to run a logistic regression model for each of countries, and all the
data is currently stored in a PostgreSQL database. Which tool/library would you use to produce
these models with the least effort?

1. RStudio
2. MADlib
3. RStudio
4. HBase

Correct Answer : 2
Explanation: MADlib is an open-source library for scalable in-database analytics. It offers dataparallel implementations of mathematical, statistical, and machine learning methods for structured and unstructured data. Because MADlib is designed and built to accommodate massive parallel processing of
data, MADlib is ideal for Big Data in-database analytics. MADlib supports the opensource database PostgreSQL as well as the Pivotal Greenplum Database and Pivotal
HAWQ. HAWQ is a SQL query engine for data stored in the Hadoop Distributed File System (HDFS).
Module Description
Generalized Linear Models : Includes linear regression, logistic regression, and multinomial logistic regression

Related Questions

Question : Your colleague, who is new to Hadoop, approaches you with a question. They want to know how
best to access their data. This colleague has previously worked extensively with SQL and
databases.
Which query interface would you recommend?

1. Flume
2. Pig
3. Hive
4. HBase

Question : In linear regression, what indicates that an estimated coefficient is significantly different than zero?

1. R-squared near 1
2. R-squared near 0
3. The estimated coefficient is greater than 3
4. A small p-value

Question : Which graphical representation shows the distribution and multiple summary statistics of a
continuous variable for each value of a corresponding discrete variable?

1. box and whisker plot
2. dotplot
3. scatterplot
4. binplot

Question : Assume that you have a data frame in R. Which function would you use to display descriptive
statistics about this variable?

1. levels
2. attributes
3. str
4. summary

Question : What is the mandatory Clause that must be included when using Window functions?

1. OVER
2. RANK
3. PARTITION BY
4. RANK BY

Question : What is the purpose of the process step "parsing" in text analysis?

1. computes the TF-IDF values for all keywords and indices
2. executes the clustering and classification to organize the contents
3. performs the search and/or retrieval in finding a specific topic or an entity in a document
4. imposes a structure on the unstructured/semi-structured text for downstream analysis