Dell EMC Data Science and BigData Certification Questions and Answers

Question : What is the format of the output from the Map function of MapReduce?

1. Key-value pairs
2. Binary respresentation of keys concatenated with structured data
3. Access Mostly Uused Products by 50000+ Subscribers
4. Unique key record and separate records of all possible values

Correct Answer : Get Lastest Questions and Answer : Exp: MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored
either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be
transmitted.

"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel - though in practice this is
limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share
the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential,
MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[10] The parallelism also offers
some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is still available.

Another way to look at MapReduce is as a 5-step parallel and distributed computation:

Prepare the Map() input - the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that
key value.
Run the user-provided Map() code - Map() is run exactly once for each K1 key value, generating output organized by key values K2.
"Shuffle" the Map output to the Reduce processors - the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated
data associated with that key value.
Run the user-provided Reduce() code - Reduce() is run exactly once for each K2 key value produced by the Map step.
Produce the final output - the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.
These five steps can be Logically thought of as running in sequence - each step starts only after the previous step is completed - although in practice they can be interleaved as long as the final result is not
affected.

In many situations, the input data might already be distributed ("sharded") among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the
locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process.

Question : Which data type value is used for the observed response variable in a logistic regression model?

1. Any integer
2. Any positive real number
3. Access Mostly Uused Products by 50000+ Subscribers
4. A binary value

Correct Answer : Get Lastest Questions and Answer : Exp: Logistic regression is used widely in many fields, including the medical and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict
mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic
regression may be used to predict whether a patient has a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood
tests, etc.; age, blood cholesterol level, systolic blood pressure, relative weight, blood hemoglobin level, smoking (at 3 levels), and abnormal electrocardiogram.).Another example might be to predict whether an
American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the
probability of failure of a given process, system or product. It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.[citation needed]
In economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional
random fields, an extension of logistic regression to sequential data, are used in natural language processing.

Question : A data scientist is given an R data frame, "empdata", with the columns Age, Salary, Occupation,
Education, and Gender. The data scientist would like to examine only the Salary and Occupation
columns for ages greater than 40. Which command extracts the appropriate rows and columns
from the data frame?

1. empdata[c("Salary", "Occupation"), empdata$Age > 40]
2. empdata[Age > 40, ("Salary", "Occupation")]
3. Access Mostly Uused Products by 50000+ Subscribers
4. empdata[, c("Salary", "Occupation")]$Age > 40

Correct Answer : Get Lastest Questions and Answer :
Exp: A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame
Build-in Data Frame
We use built-in data frames in R for our tutorials. For example, here is a built-in data frame in R, called mtcars.
> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............
The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member
of a row is called a cell. To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator. The two coordinates are separated by a comma. In other words, the
coordinates begins with row position, then followed by a comma, and ends with the column position. The order is important. Here is the cell value from the first row, second column of mtcars.
> mtcars[1, 2]
[1] 6
Moreover, we can use the row and column names instead of the numeric coordinates. > mtcars["Mazda RX4", "cyl"]
[1] 6
Lastly, the number of data rows in the data frame is given by the nrow function. > nrow(mtcars) # number of data rows
[1] 32
And the number of columns of a data frame is given by the ncol function. > ncol(mtcars) # number of columns
[1] 11
Further details of the mtcars data set is available in the R documentation. > help(mtcars)
Preview
Instead of printing out the entire data frame, it is often desirable to preview it with the head function beforehand. > head(mtcars)
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
............

Related Questions

Question : A data scientist is asked to implement an article recommendation feature for an on-line magazine.
The magazine does not want to use client tracking technologies such as cookies or reading
history. Therefore, only the style and subject matter of the current article is available for making
recommendations. All of the magazine's articles are stored in a database in a format suitable for
analytics.
Which method should the data scientist try first?

1. K Means Clustering
2. Naive Bayesian
3. Access Mostly Uused Products by 50000+ Subscribers
4. Association Rules

Question : How are window functions different from regular aggregate functions?

1. Rows retain their separate identities and the window function can access more than the current row.
2. Rows are grouped into an output row and the window function can access more than the current row.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Rows are grouped into an output row and the window function can only access the current row.

Question : Consider these item sets:
(hat, scarf, coat)
(hat, scarf, coat, gloves)
(hat, scarf, gloves)
(hat, gloves)
(scarf, coat, gloves)
What is the confidence of the rule (hat, scarf) -> gloves?

1. 66%
2. 40%
3. Access Mostly Uused Products by 50000+ Subscribers
4. 60%

Question : In the MapReduce framework, what is the purpose of the Map Function?

1. It processes the input and generates key-value pairs
2. It collects the output of the Reduce function
3. Access Mostly Uused Products by 50000+ Subscribers
4. It breaks the input into smaller components and distributes to other nodes in the cluster

Question : While having a discussion with your colleague, this person mentions that they want to perform Kmeans
clustering on text file data stored in HDFS.
Which tool would you recommend to this colleague?

1. Sqoop
2. Scribe
3. Access Mostly Uused Products by 50000+ Subscribers
4. Mahout

Question : What describes a true limitation of Logistic Regression method?

1. It does not handle redundant variables well.
2. It does not handle missing values well.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It does not have explanatory values.