Dell EMC Data Science and BigData Certification Questions and Answers

Question : The web analytics team uses Hadoop to process access logs. They now want to correlate this
data with structured user data residing in their massively parallel database. Which tool should they
use to export the structured data from Hadoop?

1. Sqoop
2. Pig
3. Access Mostly Uused Products by 50000+ Subscribers
4. Scribe

Correct Answer : Get Lastest Questions and Answer :

Explanation: Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault
tolerance.

This document describes how to get started using Sqoop to move data between databases and Hadoop and provides reference information for the operation of the Sqoop command-line tool suite. This document is intended for:

System and application programmers
System administrators
Database administrators
Data analysts
Data engineers

Question : When would you prefer a Naive Bayes model to a logistic regression model for classification?

1. When some of the input variables might be correlated
2. When all the input variables are numerical.
3. Access Mostly Uused Products by 50000+ Subscribers
4. When you are using several categorical input variables with over 1000 possible values each.

Correct Answer : Get Lastest Questions and Answer :
Explanation: Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn
from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is
independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of
these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features.

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the
method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods.

Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem
showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers.[5] Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes
classification is outperformed by other approaches, such as boosted trees or random forests.[6]

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification

Question : Before you build an ARMA model, how can you tell if your time series is weakly stationary?

1. The mean of the series is close to 0.
2. The series is normally distributed.
3. Access Mostly Uused Products by 50000+ Subscribers
4. There appears to be no apparent trend component

Correct Answer : Get Lastest Questions and Answer : Exp: In the statistical analysis of time series, autoregressive-moving-average (ARMA) models provide a parsimonious description of a (weakly) stationary stochastic process in terms of two
polynomials, one for the auto-regression and the second for the moving average. Given a time series of data Xt, the ARMA model is a tool for understanding and, perhaps, predicting future values in this series. The
model consists of two parts, an autoregressive (AR) part and a moving average (MA) part. The model is usually then referred to as the ARMA(p,q) model where p is the order of the autoregressive part and q is the order
of the moving average part . There are a number of modelling options to account for a non-constant variance, for example ARCH (and GARCH, and their many extensions) or stochastic volatility models.

An ARCH model extend ARMA models with an additional time series equation for the square error term. They tend to be pretty easy to estimate (the fGRACH R package for example).

SV models extend ARMA models with an additional time series equation (usually a AR(1)) for the log of the time-dependent variance. I have found these models are best estimated using Bayesian methods (OpenBUGS has
worked well for me in the past). You can fit ARIMA model, but first you need to stabilize the variance by applying suitable transformation. You can also use Box-Cox transformation. This has been done in the book Time
Series Analysis: With Applications in R, page 99, and then they use Box-Cox transformation. Check this link Box-Jenkins modelling Another reference is page 169, Introduction to Time Series and Forecasting, Brockwell
and Davis, "Once the data have been transformed (e.g., by some combination of Box-Cox and differencing transformations or by removal of trend and seasonal components) to the point where the transformed series X_t can
potentially be fitted by a zero-mean ARMA model, we are faced with the problem of selecting appropriate values for the orders p and q." Therefore, you need to stabilize the variance prior to fit the ARIMA model.

Related Questions

Question : What a data scientists can do with the clickstream data?
A. It can be used to discover the usage patterns.
B. It helps in finding the relationships
C. It uncovers the relationships among clicks and areas of interest on a group of websites.

1. A,B
2. B,C
3. A,C
4. A,B,C

Question : What all are the benefits of using the BigData Projects which were not available in the Non-BigData solutions?

1. It gives very quick results whatever is the data volume.

2. It helps in increasing the data security, which were not available previously.

3. It helps in complex data processing as well as helps in making quick decision even for the real-time high volume data.

4. It helps in your organization technologist requirement.

5. It helps in reducing the marking team size

Question : Which of the following question statement falls under data science category?
A. What happened in last six months?
B. How many products have been sold in a last month?
C. Where is a problem for sales?
D. Which is the optimal scenario for selling this product?
E. What happens, if these scenario continues?

1. A,B
2. B,C
3. C,D
4. D,E
5. A,E

Question : Which of the following skills a data scientists required?
A. Web designing to represent best visuals of its results from algorithm.
B. He should be creative
C. Should possess good programming skills
D. Should be very good at mathematics and statistic
E. He should possess database administrative skills.

1. A,B,C
2. B,C,D
3. C,D,E
4. A,D,E
5. A,C,E

Question : Which of the following steps you will be using in the discovery phase?
A. What all are the data sources for the project?
B. Analyze the Raw data and its format and structure.
C. What all tools are required, in the project?
D. What is the network capacity required
E. What Unix server capacity required?

1. A,B,C
2. B,C,D
3. C,D,E
4. B,C,D,E
5. A,B,C,D,E

Question : Which of the following tool can be used to load and clean the huge volume of data?

1. Spark GraphX

2. Cloudera Knox

3. Apache Hadoop

4. Oracle MySQL

5. Qlik