Dell EMC Data Science and BigData Certification Questions and Answers

Question : On analyzing your time series data you suspect that the data represented as
y1, y2, y3, ... , yn-1, yn
may have a trend component that is quadratic in nature. Which pattern of data will indicate that
the trend in the time series data is quadratic in nature?

1. (y4-y2) - (y3-y1) = ....= (yn-yn-2)-(yn-1-yn-3)

2. ((y2-y1) /y1 ) * 100% = ....((yn-yn-1)/yn-1) * 100%

3. Access Mostly Uused Products by 50000+ Subscribers

4. (y3-y2) - (y2-y1) = ....= (yn-yn-1)-(yn-1-yn-2)

Correct Answer : Get Lastest Questions and Answer :
Explanation: One definition of a time series is that of a collection of quantitative observations that are evenly spaced in time and measured successively. Examples of time series include the continuous
monitoring of a person's heart rate, hourly readings of air temperature, daily closing price of a company stock, monthly rainfall data, and yearly sales figures. Time series analysis is generally used when there are
50 or more data points in a series. If the time series exhibits seasonality, there should be 4 to 5 cycles of observations in order to fit a seasonal model to the data.

Goals of time series analysis:
1. Descriptive: Identify patterns in correlated data-trends and seasonal variation
2. Explanation: understanding and modeling the data
3. Access Mostly Uused Products by 50000+ Subscribers
4. Intervention analysis: how does a single event change the time series?
5. Quality control: deviations of a specified size indicate a problem

Time series are analyzed in order to understand the underlying structure and function that produce the observations. Understanding the mechanisms of a time series allows a mathematical model to be developed that
explains the data in such a way that prediction, monitoring, or control can occur. Examples include prediction/forecasting, which is widely used in economics and business. Monitoring of ambient conditions, or of an
input or an output, is common in science and industry. Quality control is used in computer science, communications, and industry.

It is assumed that a time series data set has at least one systematic pattern. The most common patterns are trends and seasonality. Trends are generally linear or quadratic. To find trends, moving averages or
regression analysis is often used. Seasonality is a trend that repeats itself systematically over time. A second assumption is that the data exhibits enough of a random process so that it is hard to identify the
systematic patterns within the data. Time series analysis techniques often employ some type of filter to the data in order to dampen the error. Other potential patterns have to do with lingering effects of earlier
observations or earlier random errors.

There are numerous software programs that will analyze time series, such as SPSS, JMP, and SAS/ETS. For those who want to learn or are comfortable with coding, Matlab, S-PLUS, and R are other software packages that
can perform time series analyses. Excel can be used if linear regression analysis is all that is required (that is, if all you want to find out is the magnitude of the most obvious trend). A word of caution about
using multiple regression techniques with time series data: because of the autocorrelation nature of time series, time series violate the assumption of independence of errors. Type I error rates will increase
substantially when autocorrelation is present. Also, inherent patterns in the data may dampen or enhance the effect of an intervention; in time series analysis, patterns are accounted for within the analysis.

Observations made over time can be either discrete or continuous. Both types of observations can be equally spaced, unequally spaced, or have missing data. Discrete measurements can be recorded at any time interval,
but are most often taken at evenly spaced intervals. Continuous measurements can be spaced randomly in time, such as measuring earthquakes as they occur because an instrument is constantly recording, or can entail
constant measurement of a natural phenomenon such as air temperature, or a process such as velocity of an airplane.

Question : Which analytical method is considered unsupervised?

1. Naive Bayesian classifier

2. Decision tree
3. Access Mostly Uused Products by 50000+ Subscribers
4. K-means clustering

Correct Answer : Get Lastest Questions and Answer :
Explanation: kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum
cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans,
including ones for the initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image
processing, medical, and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified,
labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar
behaviors and spending patterns.

Question : You have used k-means clustering to classify behavior of , customers for a retail store.
You decide to use household income, age, gender and yearly purchase amount as measures. You
have chosen to use 8 clusters and notice that 2 clusters only have 3 customers assigned. What
should you do?

1. Decrease the number of measures used
2. Increase the number of clusters
3. Access Mostly Uused Products by 50000+ Subscribers
4. Identify additional measures to add to the analysis

Correct Answer : Get Lastest Questions and Answer :

Explanation: kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased
further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the
initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image
processing, medical, and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified,
labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar
behaviors and spending patterns.

Related Questions

Question : Refer to the exhibit.
You are using K-means clustering to classify customer behavior for a large retailer. You need to
determine the optimum number of customer groups. You plot the within-sum-of-squares (wss)
data as shown in the exhibit. How many customer groups should you specify?

1. 2
2. 3
3. Access Mostly Uused Products by 50000+ Subscribers
4. 8

Question : Refer to the exhibit.
Click on the calculator icon in the upper left corner. You are given a list of pre-defined association
rules:
A) RENTER => BAD CREDIT
B) RENTER => GOOD CREDIT
C) HOME OWNER => BAD CREDIT
D) HOME OWNER => GOOD CREDIT
E) FREE HOUSING => BAD CREDIT
F) FREE HOUSING => GOOD CREDIT
For your next analysis, you must limit your dataset based on rules with confidence greater than
60%.
Which of the rules will be kept in the analysis?

1. Rules B and D
2. Rules A and F
3. Access Mostly Uused Products by 50000+ Subscribers
4. Rules D and E

Question : Refer to the exhibit.
You are using k-means clustering to discover groupings within a data set. You plot within-sum-ofsquares
(wss) of multiple cluster sizes. Based on the exhibit, how many clusters should you use in
your analysis?

1. 2
2. 8
3. Access Mostly Uused Products by 50000+ Subscribers
4. 10

Question : Refer to the exhibit
Consider the training data set shown in the exhibit. What are the classification (Y = 0 or 1) and the
probability of the classification for the tupleX(0, 0, 1) using Naive Bayesian classifier?

1. Classification Y = 0, Probability = 1/54
2. Classification Y = 1, Probability = 1/54
3. Access Mostly Uused Products by 50000+ Subscribers
4. Classification Y = 0, Probability = 4/54

Question : Refer to the exhibit.
In the exhibit, a correlogram is provided based on an autocorrelation analysis of a sample dataset.
What can you conclude from only this exhibit?

1. There is no structure left to model in the data
2. Lag 7 has a significant negative autocorrelation
3. Access Mostly Uused Products by 50000+ Subscribers
4. Differencing is required before proceeding with any analysis

Question : Refer to the exhibit
Which type of data issue would you suspect based on the exhibit?

1. "Saturated" data, indicating potential issues with data definitions
2. Incomplete data, indicating potential issues with data transmission
3. Access Mostly Uused Products by 50000+ Subscribers
4. The exhibit does not raise any obvious concerns with the data.