Premium

Dell EMC Data Science and BigData Certification Questions and Answers



Question : You are given , , user profile pages of an online dating site in XML files, and they are
stored in HDFS. You are assigned to divide the users into groups based on the content of their
profiles. You have been instructed to try K-means clustering on this data. How should you
proceed?


  : You are given , ,  user profile pages of an online dating site in XML files, and they are
1. Divide the data into sets of 1, 000 user profiles, and run K-means clustering in RHadoop iteratively.
2. Run MapReduce to transform the data, and find relevant key value pairs.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Partition the data by XML file size, and run K-means clustering in each partition.


Correct Answer : Get Lastest Questions and Answer :

Explanation: MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network
and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a
filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.

"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel - though in practice this is
limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share
the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential,
MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[10] The parallelism also offers
some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is still available.





Question : The Marketing department of your company wishes to track opinion on a new product that was
recently introduced. Marketing would like to know how many positive and negative reviews are
appearing over a given period and potentially retrieve each review for more in-depth insight.
They have identified several popular product review blogs that historically have published
thousands of user reviews of your company's products.

You have been asked to provide the desired analysis. You examine the RSS feeds for each blog
and determine which fields are relevant. You then craft a regular expression to match your new
product's name and extract the relevant text from each matching review.
What is the next step you should take?


  : The Marketing department of your company wishes to track opinion on a new product that was
1. Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product
2. Convert the extracted text into a suitable document representation and index into a review corpus
3. Access Mostly Uused Products by 50000+ Subscribers
4. Group the reviews using Naive Bayesian classification


Correct Answer : Get Lastest Questions and Answer :

Explanation:




Question : Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is
to a Table as R is to a ______________ .

  :  Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is
1. List
2. Matrix
3. Access Mostly Uused Products by 50000+ Subscribers
4. Array


Correct Answer : Get Lastest Questions and Answer :

Explanation: A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.

> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame
Build-in Data Frame
We use built-in data frames in R for our tutorials. For example, here is a built-in data frame in R, called mtcars.

> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............
The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member
of a row is called a cell.

To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator. The two coordinates are separated by a comma. In other words, the coordinates begins with row
position, then followed by a comma, and ends with the column position.



Related Questions


Question : Which of the following is not the Classification algorithm?

  : Which of the following is not the Classification algorithm?
1. Logistic Regression
2. Support Vector Machine
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hidden Markov Models
5. None of the above




Question : Suppose a man told you he had a nice conversation with someone on the train. Not knowing anything
about this conversation, the probability that he was speaking to a woman is 50% (assuming the train had an equal
number of men and women and the speaker was as likely to strike up a conversation with a man as with a woman).
Now suppose he also told you that his conversational partner had long hair. It is now more likely he was speaking
to a woman, since women are more likely to have long hair than men. ____________ can be used to calculate
the probability that the person was a woman.
  : Suppose a man told you he had a nice conversation with someone on the train. Not knowing anything
1. SVM
2. MLE
3. Access Mostly Uused Products by 50000+ Subscribers
4. Logistic Regression




Question : Bayes' theorem cannot finds the actual probability of an event from the results of your tests?

  : Bayes' theorem cannot finds the actual probability of an event from the results of your tests?
1. True
2. False




Question : You are creating a regression model with the input income, education and current debt of a customer,
what could be the possible output from this model.
 :  You are creating a regression model with the input income, education and current debt of a customer,
1. Customer fit as a good
2. Customer fit as acceptable or average category
3. Access Mostly Uused Products by 50000+ Subscribers
4. 1 and 3 are correct
5. 2 and 3 are correct


Question : In which of the scenario you can use the regression to predict the values
 :  In which of the scenario you can use the regression to predict the values
1. Samsung can use it for mobile sales forecast
2. Mobile companies can use it to forecast manufacturing defects
3. Access Mostly Uused Products by 50000+ Subscribers
4. Only 1 and 2
5. All 1 , 2 and 3



Question : You are creating a Classification process where input is the income, education and
current debt of a customer, what could be the possible output of this process.
 :  You are creating a Classification process where input is the income, education and
1. Probability of the customer default on loan repayment
2. Percentage of the customer loan repayment capability
3. Access Mostly Uused Products by 50000+ Subscribers
4. The output might be a risk class, such as "good", "acceptable", "average", or "unacceptable".
5. All of the above