Dell EMC Data Science and BigData Certification Questions and Answers

Question : What is required in a presentation for business analysts?

1. Operational process changes
2. Budgetary considerations and requests
3. Access Mostly Uused Products by 50000+ Subscribers
4. The presentation author's credentials

Correct Answer : Get Lastest Questions and Answer :
Exp: Business User: Someone who understands the domain area and usually benefits from the results. This person can consult and advise the project team on the context of the
project, the value of the results, and how the outputs will be operationalized. Usually business analyst, line manager, or deep subject matter expert in the project domain
fulfills this role.

Question : What is LOESS used for?

1. It plots a continuous variable versus a discrete variable, to compare distributions across classes.
2. It is a significance test for the correlation between two variables.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It is run after a one-way ANOVA, to determine which population has the highest mean value.

Correct Answer : Get Lastest Questions and Answer : Exp: the loess() function can be used to fit a nonlinear line to the data. LOESS and LOWESS (locally weighted scatterplot smoothing) are two strongly related non-parametric regression methods that
combine multiple regression models in a k-nearest-neighbor-based meta-model. "LOESS" is a later generalization of LOWESS; although it is not a true initialism, it may be understood as standing for "LOcal regrESSion".
LOESS and LOWESS thus build on "classical" methods, such as linear and nonlinear least squares regression. They address situations in which the classical procedures do not perform well or cannot be effectively applied
without undue labor. LOESS combines much of the simplicity of linear least squares regression with the flexibility of nonlinear regression. It does this by fitting simple models to localized subsets of the data to
build up a function that describes the deterministic part of the variation in the data, point by point. In fact, one of the chief attractions of this method is that the data analyst is not required to specify a global
function of any form to fit a model to the data, only to fit segments of the data.

The trade-off for these features is increased computation. Because it is so computationally intensive, LOESS would have been practically impossible to use in the era when least squares regression was being developed.
Most other modern methods for process modeling are similar to LOESS in this respect. These methods have been consciously designed to use our current computational ability to the fullest possible advantage to achieve
goals not easily achieved by traditional approaches. A smooth curve through a set of data points obtained with this statistical technique is called a Loess Curve, particularly when each smoothed value is given by a
weighted quadratic least squares regression over the span of values of the y-axis scattergram criterion variable. When each smoothed value is given by a weighted linear least squares regression over the span, this is
known as a Lowess curve; however, some authorities treat Lowess and Loess as synonyms. This is a method for fitting a smooth curve between two variables, or fitting a smooth surface between an outcome and up to four
predictor variables.

The procedure originated as LOWESS (LOcally WEighted Scatter-plot Smoother). Since then it has been extended as a modelling tool because it has some useful statistical properties (Cleveland, 1998). This is a
nonparametric method because the linearity assumptions of conventional regression methods have been relaxed. Instead of estimating parameters like m and c in y = mx +c, a nonparametric regression focuses on the fitted
curve. The fitted points and their standard errors represent are estimated with respect to the whole curve rather than a particular estimate. So, the overall uncertainty is measured as how well the estimated curve
fits the population curve. It is called local regression because the fitting at say point x is weighted toward the data nearest to x. The distance from x that is considered near to it is controlled by the span
setting, a.When a is less than 1 it represents the proportion of the data that is considered to be neighbouring x, and the weighting that is used is proportional to 1-(distance/maximum distance)^3)^3, which is known
as tricubic.When a is greater than 1 all of the points are used and the maximum distance is taken as a^(1/p) times the observed maximum distance for p predictors. The default span is a = 0.75. If you choose a span
that is too small then there will be insufficient data near x for an accurate fit, resulting in a large variance. If the span is too large than the regression will be over-smoothed, resulting in a loss of information,
hence a large bias. The trade-off between bias and variance also depends on the degree of the polynomial selected. A high degree will provide a better approximation of the population mean, so less bias, but there are
more factors to consider in the model, resulting in greater variance. The default degree is 2 (quadratic). Higher degrees don't improve the fit much. The lower degree (i.e. 1, linear) has more bias but pulls back
variance at the boundaries. There is no substitute for thinking carefully about what you are plotting, testing different settings of span and polynomial degree, and selecting the most plausible fit by eye. The summary
statistics also give an indication of how well the model fits.

The concept of degrees of freedom for nonparametric models is complex. They approximate the parametric concept of degrees of freedom empirically and result in numbers that are not necessarily integers. The assumptions
are: Around point x the mean of y can be approximated by a small class of parametric functions in polynomial regression.
The errors in estimating y are independent and randomly distributed with a mean of zero. Bias and variance are traded off by the choices for the settings of span and degree of polynomial. Technical Validation : The
LOESS function in the {stats} package of R is called. You must have R installed on the computer from which you are running StatsDirect. You can download and install R here. The R implementation is based on the cloess
algorithm, for which the original authors have a NETLIB site.

Question : Which word or phrase completes the statement? Mahout is to Hadoop as MADlib is to
____________ .

1. R
2. PostgreSQL
3. Access Mostly Uused Products by 50000+ Subscribers
4. SAS

Correct Answer : Get Lastest Questions and Answer :
Exp: Key philosophies driving the architecture of MADlib are: Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily. Utilize best of breed database engines, but
separate the machine learning logic from database specific implementation details. Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability. Open
implementation maintaining active ties into ongoing academic research. Classification

When the desired output is categorical in nature we use classification methods to build a model that predicts which of the various categories a new result would fall into. The goal of classification is to be able to
correctly label incoming records with the correct class for the record. Example: If we had data that described various demographic data and other features of individuals applying for loans and we had historical data
that included what past loans had defaulted, then we could build a model that described the likelihood that a new set of demographic data would result in a loan default. In this case the categories are "will default"
or "won't default" which are two discrete classes of output.
Regression : When the desired output is continuous in nature we use regression methods to build a model that predicts the output value.
Example: If we had data that described properties of real estate listings then we could build a model to predict the sale value for homes based on the known characteristics of the houses. This is a regression because
the output response is continuous in nature rather than categorical. (Ideally a link to a more developed use case here)
Clustering : In which we are trying to identify groups of data such that the items within one cluster are more similar to each other than they are to the items in any other cluster.
Example: In customer segmentation analysis the goal is to identify specific groups of customers that behave in a similar fashion so that various marketing campaigns can be designed to reach these markets. When the
customer segments are known in advance this would be a supervised classification task, when we let the data itself identify the segments this is a clustering task.
Topic Modeling : Topic modeling is similar to clustering in that it attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to
identify the main themes of those documents.
Association Rule Mining : Also called market basket analysis or frequent itemset mining, this is attempting to identify which items tend to occur together more frequently than random chance would indicate suggesting
an underlying relationship between the items.
Example: In an online web store association rule mining can be used to identify what products tend to be purchased together. This can then be used as input into a product recommender engine to suggest items that may
be of interest to the customer and provide upsell opportunities.
Descriptive Statistics : Descriptive statistics don't provide a model and thus are not considered a learning method, but they can be helpful in providing information to an analyst to understand the underlying data and
can provide valuable insights into the data that may influence choice of data model.
Example: Calculating the distribution of data within each variable of a dataset can help an analyst understand which variables should be treated as categorical variables and which as continuous variables as well as
understanding what sort of distribution the values fall in.
Validation : Using a model without understanding the accuracy of the model can lead to disastrous consequences. For that reason it is important to understand the error of a model and to evaluate the model for accuracy
on testing data. Frequently in data analysis a separation is made between training data and testing data solely for the purpose of providing statistically valid analysis of the validity of the model and assessment
that the model is not over-fitting the training data. N-fold cross validation is also frequently utilized.

Related Questions

Question : You have completed your model and are handing it off to be deployed in production. What should
you deliver to the production team, along with your commented code?

1. The production team needs to understand how your model will interact with the processes they
already support. Give them documentation on expected model inputs and outputs, and guidance
on error-handling.
2. The production team are technical, and they need to understand how the processes that they
support work, so give them the same presentation that you prepared for the analysts.
3. Access Mostly Uused Products by 50000+ Subscribers
to understand how your model interacts with the processes they already support. Give them the
same presentation that you prepared for the project sponsor.
4. The production team supports the processes that run the organization, and they need context
to understand how your model interacts with the processes they already support. Give them the
executive summary.

Question : Which method is used to solve for coefficients b, b, .., bn in your linear regression model :
Y = b0 + b1x1+b2x2+ .... +bnxn

1. Apriori Algorithm
2. Ridge and Lasso
3. Access Mostly Uused Products by 50000+ Subscribers
4. Integer programming

Question : You submit a MapReduce job to a Hadoop cluster and notice that although the job was
successfully submitted, it is not completing. What should you do?

1. Ensure that the NameNode is running
2. Ensure that the JobTracker is running
3. Access Mostly Uused Products by 50000+ Subscribers
4. Ensure that a DataNode is running

Question : A disk drive manufacturer has a defect rate of less than .% with % confidence. A quality
assurance team samples 1000 disk drives and finds 14 defective units. Which action should the
team recommend?

1. A larger sample size should be taken to determine if the plant is operating correctly
2. The manufacturing process is functioning properly and no further action is required
3. Access Mostly Uused Products by 50000+ Subscribers
4. There is a flaw in the quality assurance process and the sample should be repeated

Question : What is a core deliverable at the end of the analytic project?

1. An implemented database design
2. A whitepaper describing the project and the implementation
3. Access Mostly Uused Products by 50000+ Subscribers
4. The training materials

Question : Your organization has a website where visitors randomly receive one of two coupons. It is also
possible that visitors to the website will not receive a coupon. You have been asked to determine if
offering a coupon to visitors to your website has any impact on their purchase decision.
Which analysis method should you use?

1. K-means clustering
2. Association rules
3. Access Mostly Uused Products by 50000+ Subscribers
4. One-way ANOVA