Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : Which method is used to solve for coefficients b, b, .., bn in your linear regression model :
Y = b0 + b1x1+b2x2+ .... +bnxn

1. Apriori Algorithm
2. Ridge and Lasso
3. Ordinary Least squares
4. Integer programming

Correct Answer : 3
Explanation: RY = b0 + b1x1+b2x2+ .... +bnxn
In the linear model, the bi's represent the unknown p parameters. The estimates for these
unknown parameters are chosen so that, on average, the model provides a reasonable
estimate of a person's income based on age and education. In other words, the fitted model
should minimize the overall error between the linear model and the actual observations.
Ordinary Least Squares (OLS) is a common technique to estimate the parameters

Question : What describes a true limitation of Logistic Regression method?

1. It does not handle redundant variables well.
2. It does not handle missing values well.
3. It does not handle correlated variables well.
4. It does not have explanatory values.

Correct Answer : 2
Explanation: Logistic regression extends the ideas of linear regression to the situation where the dependent variable, Y, is categorical. We can think of a categorical variable as dividing the observations into classes. For example, if Y denotes a recommendation on holding/selling/buying a stock, we have a categorical variable with three categories. We can think of each of the stocks in the dataset (the observations) as belonging to one of three classes: the hold class, the sell class, and the buy class. Logistic regression can be used for classifying a new observation, where the class is unknown, into one of the classes, based on the values of its predictor variables (called classification). It can also be used in data (where the class is known) to find similarities between observations within each class in terms of the predictor variables (called profiling). For example, a logistic regression model can be built to determine if a person will or will not purchase a new automobile in the next 12 months. The training set could include input variables for a person's age, income, and gender as well as the age of an existing automobile. The training set would also include the outcome variable on whether the person purchased a new automobile over a 12-month period. The logistic regression model provides the likelihood or probability of a person making a purchase in the next 12 months. After examining a few more use cases for logistic regression, the remaining portion of this chapter examines how to build and evaluate a logistic regression model. Logistic regression attempts to predict outcomes based on a set of independent variables, but if researchers include the wrong independent variables, the model will have little to no predictive value. For example, if college admissions decisions depend more on letters of recommendation than test scores, and researchers don't include a measure for letters of recommendation in their data set, then the logit model will not provide useful or accurate predictions. This means that logistic regression is not a useful tool unless researchers have already identified all the relevant independent variables.

Question : You submit a MapReduce job to a Hadoop cluster and notice that although the job was
successfully submitted, it is not completing. What should you do?

1. Ensure that the NameNode is running
2. Ensure that the JobTracker is running
3. Ensure that the TaskTracker is running.
4. Ensure that a DataNode is running

Correct Answer : 3
Explanation: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.

Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.

Related Questions

Question : You have run the association rules algorithm on your data set, and the two rules {banana, apple}
=> {grape} and {apple, orange}=> {grape} have been found to be relevant. What else must be true?

1. {grape, apple, orange} must be a frequent itemset.
2. {banana, apple, grape, orange} must be a frequent itemset.
3. {grape} => {banana, apple} must be a relevant rule.
4. {banana, apple} => {orange} must be a relevant rule.

There are 600,000 transactions in total.
7,500 transactions contain diapers (1.25 percent)
60,000 transactions contain beer (10 percent)
6,000 transactions contain both diapers and beer (1.0 percent)
If there was no association between beer and diapers (i.e., they are statistically independent), then we expect only 10% of diaper purchasers to also buy beer (since 10% of all customers buy beer). However, we discover that 80% (=6000/7500) of diaper purchasers also buy beer. This is a factor of 8 increase over what was expected - that is called Lift, which is the ratio of the observed frequency of co-occurrence to the expected frequency. This was determined simply by counting the transactions in the database. So, in this case, the association rule would state that diaper purchasers will also buy beer with a Lift factor of 8. In statistics, Lift is simply estimated by the ratio of the joint probability of two items x and y, divided by the product of their individual probabilities: Lift = P(x,y)/[P(x)P(y)]. If the two items are statistically independent, then P(x,y)=P(x)P(y), corresponding to Lift = 1 in that case. Note that anti-correlation yields Lift values less than 1, which is also an interesting discovery - corresponding to mutually exclusive items that rarely co-occur together.

Question : When would you use a Wilcoxson Rank Sum test?

1. When the data can easily be sorted
2. When the populations represent the sums of other values
3. When the data cannot easily be sorted
4. When you cannot make an assumption about the distribution of the populations

Question : In the MapReduce framework, what is the purpose of the Reduce function?

1. It writes the output of the Map function to storage
2. It breaks the input into smaller components and distributes to other nodes in the cluster
3. It aggregates the results of the Map function and generates processed output
4. It distributes the input to multiple nodes for processing

Question : Which of the following is an example of quasi-structured data?

1. OLAP
2. Customer record table
3. Clickstream data
4. OLTP

Question : A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse
contains data collected from many sources and transformed through a complex, multi-stage ETL
process. What is a concern the data scientist should have about the data?

1. It is too processed
2. It is not structured
3. It is not normalized
4. It is too centralized

Question : Which word or phrase completes the statement? Emphasis color is to standard color as _______ .

1. Main message is to key findings
2. Frequent item set is to item
3. Main message is to context
4. Pie chart is to proportions

Question : Which data asset is an example of semi-structured data?

1. XML data file
2. Database table
3. Webserver log
4. News article