Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : In the MapReduce framework, what is the purpose of the Map Function?

1. It processes the input and generates key-value pairs
2. It collects the output of the Reduce function
3. It sorts the results of the Reduce function
4. It breaks the input into smaller components and distributes to other nodes in the cluster

Correct Answer : 1
Explanation: MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.

"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel - though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[10] The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is still available.

Question : You have completed your model and are handing it off to be deployed in production. What should
you deliver to the production team, along with your commented code?

1. The production team needs to understand how your model will interact with the processes they
already support. Give them documentation on expected model inputs and outputs, and guidance
on error-handling.
2. The production team are technical, and they need to understand how the processes that they
support work, so give them the same presentation that you prepared for the analysts.
3. The production team supports the processes that run the organization, and they need context
to understand how your model interacts with the processes they already support. Give them the
same presentation that you prepared for the project sponsor.
4. The production team supports the processes that run the organization, and they need context
to understand how your model interacts with the processes they already support. Give them the
executive summary.

Correct Answer : 1

Explanation: Data Analytics Lifecycle
1-Discovery:
2-Data preparation:
3-Model planning:
4-Model building:
5-Communicate results:
6-Operationalize: the team delivers final reports, briefings, code, and technical documents. In addition, the team may run a pilot project to implement the models in a production environment.

Question : While having a discussion with your colleague, this person mentions that they want to perform Kmeans
clustering on text file data stored in HDFS.
Which tool would you recommend to this colleague?

1. Sqoop
2. Scribe
3. HBase
4. Mahout

Correct Answer : 4
Explanation: Apache Mahout is a suite of machine learning libraries designed to be scalable and robust. k-Means is a simple but well-known algorithm for grouping objects, clustering. All objects need to be represented as a set of numerical features. In addition, the user has to specify the number of groups (referred to as k) she wishes to identify.
Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to the center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task.
After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations.
Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. What algorithms are implemented in Mahout?
We are interested in a wide variety of machine learning algorithms. Many of which are already implemented in Mahout. You can find a list here.
What algorithms are missing from Mahout?
There are many machine learning algorithms that we would like to have in Mahout. If you have an algorithm or an improvement to an algorithm that you would like to implement, start a discussion on our mailing list.
Do I need Hadoop to use Mahout?
There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the algorithms list. In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as Apache Spark.

Related Questions

Question : A disk drive manufacturer has a defect rate of less than .% with % confidence. A quality
assurance team samples 1000 disk drives and finds 14 defective units. Which action should the
team recommend?

1. A larger sample size should be taken to determine if the plant is operating correctly
2. The manufacturing process is functioning properly and no further action is required
3. A smaller sample size should be taken to determine if the plant is operating correctly
4. There is a flaw in the quality assurance process and the sample should be repeated

Question : What is a core deliverable at the end of the analytic project?

1. An implemented database design
2. A whitepaper describing the project and the implementation
3. A presentation for project sponsors
4. The training materials

Question : You have been assigned to run a logistic regression model for each of countries, and all the
data is currently stored in a PostgreSQL database. Which tool/library would you use to produce
these models with the least effort?

1. RStudio
2. MADlib
3. RStudio
4. HBase

Question : Your organization has a website where visitors randomly receive one of two coupons. It is also
possible that visitors to the website will not receive a coupon. You have been asked to determine if
offering a coupon to visitors to your website has any impact on their purchase decision.
Which analysis method should you use?

1. K-means clustering
2. Association rules
3. Student T-test
4. One-way ANOVA

Question : Imagine you are trying to hire a Data Scientist for your team. In addition to technical ability and
quantitative background, which additional essential trait would you look for in people applying for
this position?

1. Communication skill
2. Scientific background
3. Domain expertise
4. Well Organized

Question : What describes the use of UNION clause in a SQL statement?

1. Operates on queries and potentially decreases the number of rows
2. Operates on queries and potentially increases the number of rows
3. Operates on tables and potentially decreases the number of columns
4. Operates on both tables and queries and potentially increases both the number of rows and columns