Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : On analyzing your time series data you suspect that the data represented as
y1, y2, y3, ... , yn-1, yn
may have a trend component that is quadratic in nature. Which pattern of data will indicate that
the trend in the time series data is quadratic in nature?

1. (y4-y2) - (y3-y1) = ....= (yn-yn-2)-(yn-1-yn-3)

2. ((y2-y1) /y1 ) * 100% = ....((yn-yn-1)/yn-1) * 100%

3. (y2-y1) = (y3-y2) = .... = (yn-yn-1)

4. (y3-y2) - (y2-y1) = ....= (yn-yn-1)-(yn-1-yn-2)

Correct Answer : 4
Explanation: One definition of a time series is that of a collection of quantitative observations that are evenly spaced in time and measured successively. Examples of time series include the continuous monitoring of a person's heart rate, hourly readings of air temperature, daily closing price of a company stock, monthly rainfall data, and yearly sales figures. Time series analysis is generally used when there are 50 or more data points in a series. If the time series exhibits seasonality, there should be 4 to 5 cycles of observations in order to fit a seasonal model to the data.

Goals of time series analysis:
1. Descriptive: Identify patterns in correlated data-trends and seasonal variation
2. Explanation: understanding and modeling the data
3. Forecasting: prediction of short-term trends from previous patterns
4. Intervention analysis: how does a single event change the time series?
5. Quality control: deviations of a specified size indicate a problem

Time series are analyzed in order to understand the underlying structure and function that produce the observations. Understanding the mechanisms of a time series allows a mathematical model to be developed that explains the data in such a way that prediction, monitoring, or control can occur. Examples include prediction/forecasting, which is widely used in economics and business. Monitoring of ambient conditions, or of an input or an output, is common in science and industry. Quality control is used in computer science, communications, and industry.

It is assumed that a time series data set has at least one systematic pattern. The most common patterns are trends and seasonality. Trends are generally linear or quadratic. To find trends, moving averages or regression analysis is often used. Seasonality is a trend that repeats itself systematically over time. A second assumption is that the data exhibits enough of a random process so that it is hard to identify the systematic patterns within the data. Time series analysis techniques often employ some type of filter to the data in order to dampen the error. Other potential patterns have to do with lingering effects of earlier observations or earlier random errors.

There are numerous software programs that will analyze time series, such as SPSS, JMP, and SAS/ETS. For those who want to learn or are comfortable with coding, Matlab, S-PLUS, and R are other software packages that can perform time series analyses. Excel can be used if linear regression analysis is all that is required (that is, if all you want to find out is the magnitude of the most obvious trend). A word of caution about using multiple regression techniques with time series data: because of the autocorrelation nature of time series, time series violate the assumption of independence of errors. Type I error rates will increase substantially when autocorrelation is present. Also, inherent patterns in the data may dampen or enhance the effect of an intervention; in time series analysis, patterns are accounted for within the analysis.

Observations made over time can be either discrete or continuous. Both types of observations can be equally spaced, unequally spaced, or have missing data. Discrete measurements can be recorded at any time interval, but are most often taken at evenly spaced intervals. Continuous measurements can be spaced randomly in time, such as measuring earthquakes as they occur because an instrument is constantly recording, or can entail constant measurement of a natural phenomenon such as air temperature, or a process such as velocity of an airplane.

Question : Which analytical method is considered unsupervised?

1. Naive Bayesian classifier

2. Decision tree
3. Linear regression
4. K-means clustering

Correct Answer : 4
Explanation: kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image processing, medical, and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified,
labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar
behaviors and spending patterns.

Question : You have used k-means clustering to classify behavior of , customers for a retail store.
You decide to use household income, age, gender and yearly purchase amount as measures. You
have chosen to use 8 clusters and notice that 2 clusters only have 3 customers assigned. What
should you do?

1. Decrease the number of measures used
2. Increase the number of clusters
3. Decrease the number of clusters
4. Identify additional measures to add to the analysis

Correct Answer : 3

Explanation: kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image processing, medical, and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified,
labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar
behaviors and spending patterns.

Related Questions

Question : The web analytics team uses Hadoop to process access logs. They now want to correlate this
data with structured user data residing in their massively parallel database. Which tool should they
use to export the structured data from Hadoop?

1. Sqoop
2. Pig
3. Access Mostly Uused Products by 50000+ Subscribers
4. Scribe

Question : When would you prefer a Naive Bayes model to a logistic regression model for classification?

1. When some of the input variables might be correlated
2. When all the input variables are numerical.
3. Access Mostly Uused Products by 50000+ Subscribers
4. When you are using several categorical input variables with over 1000 possible values each.

Question : Before you build an ARMA model, how can you tell if your time series is weakly stationary?

1. The mean of the series is close to 0.
2. The series is normally distributed.
3. Access Mostly Uused Products by 50000+ Subscribers
4. There appears to be no apparent trend component

Question : What is an example of a null hypothesis?

1. that a newly created model provides a prediction of a null sample mean
2. that a newly created model does not provide better predictions than the currently existing model
3. Access Mostly Uused Products by 50000+ Subscribers
4. that a newly created model provides a prediction that will be well fit to the null distribution

Question : You have fit a decision tree classifier using input variables. The resulting tree used of the
variables, and is 5 levels deep. Some of the nodes contain only 3 data points. The AUC of the
model is 0.85. What is your evaluation of this model?

1. The tree did not split on all the input variables. You need a larger data set to get a more accurate model.
2. The AUC is high, and the small nodes are all very pure. This is an accurate model.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The AUC is high, so the overall model is accurate. It is not well-calibrated, because the small nodes will give poor estimates of probability.

Question : If your intention is to show trends over time, which chart type is the most appropriate way to depict the data?

1. Line chart
2. Bar chart
3. Access Mostly Uused Products by 50000+ Subscribers
4. Histogram