Premium

Dell EMC Data Science and BigData Certification Questions and Answers



Question : In the MapReduce framework, what is the purpose of the Map Function?
  : In the MapReduce framework, what is the purpose of the Map Function?
1. It processes the input and generates key-value pairs
2. It collects the output of the Reduce function
3. Access Mostly Uused Products by 50000+ Subscribers
4. It breaks the input into smaller components and distributes to other nodes in the cluster


Correct Answer : Get Lastest Questions and Answer :
Explanation: MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored
either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be
transmitted.

"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel - though in practice this is
limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share
the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential,
MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[10] The parallelism also offers
some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is still available.





Question : While having a discussion with your colleague, this person mentions that they want to perform Kmeans
clustering on text file data stored in HDFS.
Which tool would you recommend to this colleague?

  : While having a discussion with your colleague, this person mentions that they want to perform Kmeans
1. Sqoop
2. Scribe
3. Access Mostly Uused Products by 50000+ Subscribers
4. Mahout


Correct Answer : Get Lastest Questions and Answer :
Explanation: Apache Mahout is a suite of machine learning libraries designed to be scalable and robust. k-Means is a simple but well-known algorithm for grouping objects, clustering. All objects need to be
represented as a set of numerical features. In addition, the user has to specify the number of groups (referred to as k) she wishes to identify.
Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k
points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to the center they are closest to. Usually the distance measure is chosen by the user and
determined by the learning task.
After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges.
The algorithm can be proven to converge after a finite number of iterations.
Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains
the same. What algorithms are implemented in Mahout?
We are interested in a wide variety of machine learning algorithms. Many of which are already implemented in Mahout. You can find a list here.
What algorithms are missing from Mahout?
There are many machine learning algorithms that we would like to have in Mahout. If you have an algorithm or an improvement to an algorithm that you would like to implement, start a discussion on our mailing list.
Do I need Hadoop to use Mahout?
There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the algorithms list. In the future, we might provide more algorithm implementations on platforms more suitable for
machine learning such as Apache Spark.






Question : What describes a true limitation of Logistic Regression method?

 :   What describes a true limitation of Logistic Regression method?
1. It does not handle redundant variables well.
2. It does not handle missing values well.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It does not have explanatory values.




Correct Answer : Get Lastest Questions and Answer :
Explanation: Logistic regression extends the ideas of linear regression to the situation where the dependent variable, Y, is categorical. We can think of a categorical variable as dividing the observations
into classes. For example, if Y denotes a recommendation on holding/selling/buying a stock, we have a categorical variable with three categories. We can think of each of the stocks in the dataset (the observations) as
belonging to one of three classes: the hold class, the sell class, and the buy class. Logistic regression can be used for classifying a new observation, where the class is unknown, into one of the classes, based on
the values of its predictor variables (called classification). It can also be used in data (where the class is known) to find similarities between observations within each class in terms of the predictor variables
(called profiling). For example, a logistic regression model can be built to determine if a person will or will not purchase a new automobile in the next 12 months. The training set could include input variables for a
person's age, income, and gender as well as the age of an existing automobile. The training set would also include the outcome variable on whether the person purchased a new automobile over a 12-month period. The
logistic regression model provides the likelihood or probability of a person making a purchase in the next 12 months. After examining a few more use cases for logistic regression, the remaining portion of this chapter
examines how to build and evaluate a logistic regression model. Logistic regression attempts to predict outcomes based on a set of independent variables, but if researchers include the wrong independent variables, the
model will have little to no predictive value. For example, if college admissions decisions depend more on letters of recommendation than test scores, and researchers don't include a measure for letters of
recommendation in their data set, then the logit model will not provide useful or accurate predictions. This means that logistic regression is not a useful tool unless researchers have already identified all the
relevant independent variables.






Related Questions


Question : Your colleague, who is new to Hadoop, approaches you with a question. They want to know how
best to access their data. This colleague has a strong background in data flow languages and
programming.
Which query interface would you recommend?

 :  Your colleague, who is new to Hadoop, approaches you with a question. They want to know how
1. Hive
2. Pig
3. Access Mostly Uused Products by 50000+ Subscribers
4. HBase



Question : The web analytics team uses Hadoop to process access logs. They now want to correlate this
data with structured user data residing in a production single-instance JDBC database. They
collaborate with the production team to import the data into Hadoop. Which tool should they use?
 : The web analytics team uses Hadoop to process access logs. They now want to correlate this
1. Chukwa
2. Sqoop
3. Access Mostly Uused Products by 50000+ Subscribers
4. Flume




Question : What does the R code
z <- f[1:10, ]
do?

 : What does the R code
1. Assigns the 1st 10 columns of the 1st row of f to z
2. Assigns a sequence of values from 1 to 10 to z
3. Access Mostly Uused Products by 50000+ Subscribers
4. Assigns the first 10 rows of f to the vector z




Question : In R, functions like plot() and hist() are known as what?
 : In R, functions like plot() and hist() are known as what?
1. generic functions
2. virtual methods
3. Access Mostly Uused Products by 50000+ Subscribers
4. generic methods





Question : Review the following code:
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY CUBE(pn, vn)
ORDER BY 1, 2, 3;
Which combination of subtotals do you expect to be returned by the query?

 : Review the following code:
1. (pn, vn)
2. ( (pn, vn), (pn) )
3. Access Mostly Uused Products by 50000+ Subscribers
4. ( (pn, vn) , (pn), (vn) , ( ) )



Question : In MADlib what does MAD stand for?

 : In MADlib what does MAD stand for?
1. Machine Learning, Algorithms for Databases
2. Mathematical Algorithms for Databases
3. Access Mostly Uused Products by 50000+ Subscribers
4. Modular, Accurate, Dependable