Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : You are given , , user profile pages of an online dating site in XML files, and they are
stored in HDFS. You are assigned to divide the users into groups based on the content of their
profiles. You have been instructed to try K-means clustering on this data. How should you
proceed?

1. Divide the data into sets of 1, 000 user profiles, and run K-means clustering in RHadoop iteratively.
2. Run MapReduce to transform the data, and find relevant key value pairs.
3. Run a Naive Bayes classification as a pre-processing step in HDFS.
4. Partition the data by XML file size, and run K-means clustering in each partition.

Correct Answer : Get Lastest Questions and Answer :

Explanation: MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.

"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel - though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[10] The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is still available.

Question : The Marketing department of your company wishes to track opinion on a new product that was
recently introduced. Marketing would like to know how many positive and negative reviews are
appearing over a given period and potentially retrieve each review for more in-depth insight.
They have identified several popular product review blogs that historically have published
thousands of user reviews of your company's products.

You have been asked to provide the desired analysis. You examine the RSS feeds for each blog
and determine which fields are relevant. You then craft a regular expression to match your new
product's name and extract the relevant text from each matching review.
What is the next step you should take?

1. Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product
2. Convert the extracted text into a suitable document representation and index into a review corpus
3. Read the extracted text for each review and manually tabulate the results
4. Group the reviews using Naive Bayesian classification

Correct Answer : Get Lastest Questions and Answer :

Explanation:

Question : Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is
to a Table as R is to a ______________ .

1. List
2. Matrix
3. Data frame
4. Array

Correct Answer : Get Lastest Questions and Answer :

Explanation: A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.

> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame
Build-in Data Frame
We use built-in data frames in R for our tutorials. For example, here is a built-in data frame in R, called mtcars.

> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............
The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell.

To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator. The two coordinates are separated by a comma. In other words, the coordinates begins with row position, then followed by a comma, and ends with the column position.

Related Questions

Question : What is the format of the output from the Map function of MapReduce?

1. Key-value pairs
2. Binary respresentation of keys concatenated with structured data
3. Access Mostly Uused Products by 50000+ Subscribers
4. Unique key record and separate records of all possible values

Question : Which data type value is used for the observed response variable in a logistic regression model?

1. Any integer
2. Any positive real number
3. Access Mostly Uused Products by 50000+ Subscribers
4. A binary value

Question : A data scientist is given an R data frame, "empdata", with the columns Age, Salary, Occupation,
Education, and Gender. The data scientist would like to examine only the Salary and Occupation
columns for ages greater than 40. Which command extracts the appropriate rows and columns
from the data frame?

1. empdata[c("Salary", "Occupation"), empdata$Age > 40]
2. empdata[Age > 40, ("Salary", "Occupation")]
3. Access Mostly Uused Products by 50000+ Subscribers
4. empdata[, c("Salary", "Occupation")]$Age > 40

Question : What is required in a presentation for business analysts?

1. Operational process changes
2. Budgetary considerations and requests
3. Access Mostly Uused Products by 50000+ Subscribers
4. The presentation author's credentials

Question : What is LOESS used for?

1. It plots a continuous variable versus a discrete variable, to compare distributions across classes.
2. It is a significance test for the correlation between two variables.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It is run after a one-way ANOVA, to determine which population has the highest mean value.

Question : Which word or phrase completes the statement? Mahout is to Hadoop as MADlib is to
____________ .

1. R
2. PostgreSQL
3. Access Mostly Uused Products by 50000+ Subscribers
4. SAS