Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : Trend, seasonal, and cyclical are components of a time series. What is another component?

1. Irregular
2. Linear
3. Quadratic
4. Exponential

Correct Answer : Get Lastest Questions and Answer :

Explanation:

Question : You are studying the behavior of a population, and you are provided with multidimensional data at
the individual level. You have identified four specific individuals who are valuable to your study,
and would like to find all users who are most similar to each individual. Which algorithm is the
most appropriate for this study?

1. Association rules
2. Decision trees
3. Linear regression
4. K-means clustering

Correct Answer : Get Lastest Questions and Answer :

Explanation: kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image processing, medical, and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified,
labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar
behaviors and spending patterns.

Question : You are using MADlib for Linear Regression analysis. Which value does the statement return?
SELECT (linregr(depvar, indepvar)).r2 FROM zeta1;

1. Coefficients
2. Standard error
3. Goodness of fit
4. P-value

Correct Answer : Get Lastest Questions and Answer :
Explanation: Ordinary least-squares (OLS) linear regression refers to a stochastic model in which the conditional mean of the dependent variable (usually denoted ) is an affine function of the vector of independent variables (usually denoted ). for some unknown vector of coefficients . The assumption is that the residuals are i.i.d. distributed Gaussians. That is, the (conditional) probability density of is given by OLS linear regression finds the vector of coefficients that maximizes the likelihood of the observations.
Ordinary Least Squares Regression, also called Linear Regression, is a statistical model used to fit linear models.
It models a linear relationship of a scalar dependent variable to one or more explanatory independent variables to build a model of coefficients.
Training Function : linregr_train(source_table, out_table, dependent_varname, independent_varname, input_group_cols := NULL, heteroskedasticity_option := NULL)
source_table : Text value. The name of the table containing the training data.
out_table : Text value. Name of the generated table containing the output model.
dependent_varname : Text value. Expression to evaluate for the dependent variable.
independent_varname : Text value. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list.
input_group_cols : Text value. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL GROUP BY clause. When this value is null, no grouping is used and a single result model is generated. Default value: NULL.
heteroskedasticity_option : Boolean value. When True, the heteroskedacity of the model is also calculated and returned with the results. Default value: False.
Output Table : The output table produced by the linear regression training function contains the following columns.
Any grouping columns provided during training. Present only if the grouping option is used.
coef : Float array. Vector of the coefficients of the regression.
r2 : Float. R-squared coefficient of determination of the model.
std_err ": Float array. Vector of the standard error of the coefficients.
t_stats : Float array. Vector of the t-statistics of the coefficients.
p_values : Float array. Vector of the p-values of the coefficients.
condition_no : Float array. The condition number of the matrix. A high condition number is usually an indication that there may be some numeric instability in the result yielding a less reliable model. A high condition number often results when there is a significant amount of colinearity in the underlying design matrix, in which case other regression techniques, such as elastic net regression, may be more appropriate. bp_stats : Float. The Breush-Pagan statistic of heteroskedacity. Present only if the heteroskedacity argument was set to True when the model was trained. bp_p_value : Float. The Breush-Pagan calculated p-value. Present only if the heteroskedacity parameter was set to True when the model was trained.

Related Questions

Question : You have been assigned to run a linear regression model for each of , distinct districts, and
all the data is currently stored in a PostgreSQL database. Which tool/library would you use to
produce these models with the least effort?

1. MADlib
2. Mahout
3. Access Mostly Uused Products by 50000+ Subscribers
4. HBase

Question : Your customer provided you with , unlabeled records and asked you to separate them into
three groups. What is the correct analytical method to use?

1. Semi Linear Regression
2. Logistic regression
3. Access Mostly Uused Products by 50000+ Subscribers
4. Linear regression
5. K-means clustering

Question : You are performing a market basket analysis using the Apriori algorithm. Which measure is a ratio
describing the how many more times two items are present together than would be expected if
those two items are statistically independent?

1. Confidence
2. Support
3. Access Mostly Uused Products by 50000+ Subscribers
4. Lift

Question : In which lifecycle stage are appropriate analytical techniques determined?

1. Model planning
2. Model building
3. Access Mostly Uused Products by 50000+ Subscribers
4. Discovery

Question : What is Hadoop?

1. Java classes for HDFS types and MapReduce job management and HDFS
2. Java classes for HDFS types and MapReduce job management and the MapReduce paradigm
3. Access Mostly Uused Products by 50000+ Subscribers
4. MapReduce paradigm and massive unstructured data storage on commodity hardware

Question : You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient
Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a
pair-wise plot of the clusters, you notice that there is significant overlap between the clusters.
What should you do?

1. Decrease the number of clusters
2. Increase the number of clusters
3. Access Mostly Uused Products by 50000+ Subscribers
4. Identify additional measures to add to the analysis