Dell EMC Data Science and BigData Certification Questions and Answers

Question : In which lifecycle stage are appropriate analytical techniques determined?

1. Model planning
2. Model building
3. Access Mostly Uused Products by 50000+ Subscribers
4. Discovery

Correct Answer : Get Lastest Questions and Answer :
Explanation: In Phase 3, the data science team identifies candidate models to apply to the data for clustering, classifying, or finding relationships in the data depending on the goal of the project, It
is during this phase that the team refers to the hypotheses developed in Phase 1, when they first became acquainted with the data and understanding the business problems or domain area. These hypotheses help the team
frame the analytics to execute in Phase 4 and select the right methods to achieve its objectives.
Some of the activities to consider in this phase include the following:
Assess the structure of the datasets. The structure of the datasets is one factor that dictates the tools and analytical techniques for the next phase. Depending on whether the team plans to analyze textual data or
transactional data, for example, different tools and approaches are required.
Ensure that the analytical techniques enable the team to meet the business objectives and accept or reject the working hypotheses.
Determine if the situation warrants a single model or a series of techniques as part of a larger analytic workflow. A few example models include association rules and logistic
regression Other tools, such as Alpine Miner, enable users to set up a series of steps and analyses and can serve as a front-end user interface (UI) for manipulating Big Data
sources in PostgreSQL.

Question : What is Hadoop?

1. Java classes for HDFS types and MapReduce job management and HDFS
2. Java classes for HDFS types and MapReduce job management and the MapReduce paradigm
3. Access Mostly Uused Products by 50000+ Subscribers
4. MapReduce paradigm and massive unstructured data storage on commodity hardware

Correct Answer : Get Lastest Questions and Answer :

Explanation: Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of
machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.
Why use Hadoop?

Hadoop changes the economics and the dynamics of large-scale computing.
Its impact can be boiled down to four salient characteristics.

Hadoop enables a computing solution that is:

Scalable
A cluster can be expanded by adding new servers or resources without having to move, reformat, or change the dependent analytic workflows or applications.

Cost effective
Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.

Flexible
Hadoop is schema-less and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one
system can provide.

Fault tolerant
When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Question : You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient
Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a
pair-wise plot of the clusters, you notice that there is significant overlap between the clusters.
What should you do?

1. Decrease the number of clusters
2. Increase the number of clusters
3. Access Mostly Uused Products by 50000+ Subscribers
4. Identify additional measures to add to the analysis

Correct Answer : Get Lastest Questions and Answer :

Explanation: ntroduction to k-Means Clustering
k-means clustering is a partitioning method. The function kmeans partitions data into k mutually exclusive clusters, and returns the index of the cluster to which it has assigned each observation. Unlike hierarchical
clustering, k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures), and creates a single level of clusters. The distinctions mean that k-means clustering is often
more suitable than hierarchical clustering for large amounts of data.

kmeans treats each observation in your data as an object having a location in space. It finds a partition in which objects within each cluster are as close to each other as possible, and as far from objects in other
clusters as possible. You can choose from five different distance measures, depending on the kind of data you are clustering.

Each cluster in the partition is defined by its member objects and by its centroid, or center. The centroid for each cluster is the point to which the sum of distances from all objects in that cluster is minimized.
kmeans computes cluster centroids differently for each distance measure, to minimize the sum with respect to the measure that you specify.

kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased
further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the
initial values of the cluster centroids, and for the maximum number of iterations. By default, kmeans uses the k-means++ algorithm for cluster center initialization and the squared Euclidean metric to determine
distances.

Create Clusters and Determine Separation
The following example explores possible clustering in four-dimensional data by analyzing the results of partitioning the points into three, four, and five clusters.

Note Because each part of this example generates random numbers sequentially, i.e., without setting a new state, you must perform all steps in sequence to duplicate the results shown. If you perform the steps out of
sequence, the answers will be essentially the same, but the intermediate results, number of iterations, or ordering of the silhouette plots may differ. overlapping clustering (also: alternative clustering, multi-view
clustering): while usually a hard clustering, objects may belong to more than one cluster.

Related Questions

Question : : In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

1. Discovery
2. Data Preparation
3. Access Mostly Uused Products by 50000+ Subscribers
4. Communicate Results

Question : You are testing two new weight-gain formulas for puppies. The test gives the results:
Control group: 1% weight gain
Formula A. 3% weight gain
Formula B. 4% weight gain
A one-way ANOVA returns a p-value = 0.027
What can you conclude?

1. Formula A and Formula B are about equally effective at promoting weight gain.
2. Formula A and Formula B are both effective at promoting weight gain.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Either Formula A or Formula B is effective at promoting weight gain.

Question : Data visualization is used in the final presentation of an analytics project. For what else is this
technique commonly used?

1. Data exploration
2. Descriptive statistics
3. Access Mostly Uused Products by 50000+ Subscribers
4. Model selection

Question : Which functionality do regular expressions provide?

1. increased numerical precision
2. underflow prevention
3. Access Mostly Uused Products by 50000+ Subscribers
4. decreased processing complexity

Question : When creating a project sponsor presentation, what is the main objective?

1. Show that you met the project goals
2. Show how you met the project goals
3. Access Mostly Uused Products by 50000+ Subscribers
4. Clearly describe the methods and techniques used

Question : The average purchase size from your online sales site is $, . The customer experience team
believes a certain adjustment of the website will increase sales. A pilot study on a few hundred
customers showed an increase in average purchase size of $1.47, with a significance level of
p=0.1.
The team runs a larger study, of a few thousand customers. The second study shows an
increased average purchase size of $0.74, with a significance level of 0.03. What is your
assessment of this study?

1. The change in purchase size is not practically important, and the good p-value of the second
study is probably a result of the large study size.
2. The change in purchase size is small, but may aggregate up to a large increase in profits over
the entire customer base.
3. Access Mostly Uused Products by 50000+ Subscribers
should run another, larger study.
4. The p-value of the second study shows a statistically significant change in purchase size. The
new website is an improvement.