Dell EMC Data Science and BigData Certification Questions and Answers

Question : Your colleague, who is new to Hadoop, approaches you with a question. They want to know how
best to access their data. This colleague has previously worked extensively with SQL and
databases.
Which query interface would you recommend?

1. Flume
2. Pig
3. Access Mostly Uused Products by 50000+ Subscribers
4. HBase

Correct Answer : Get Lastest Questions and Answer :
Explanation: People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based
data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into
Apache Hadoop and working with it to get it into the form for querying. Their creation, called Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now
you should be aware that HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed acros a Hadoop cluster.

For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. You can run them from a
command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive Thrift
Client. The Hive Thrift Client is much like any database client that gets installed on a user's client machine (or in a middle tier of a three-tier architecture): it communicates with the Hive services running on the
server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or
Informix).

Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long
sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response
times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Question : In linear regression, what indicates that an estimated coefficient is significantly different than zero?

1. R-squared near 1
2. R-squared near 0
3. Access Mostly Uused Products by 50000+ Subscribers
4. A small p-value

Correct Answer : 4
Explanation: The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In
other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable. Conversely, a larger
(insignificant) p-value suggests that changes in the predictor are not associated with changes in the response. Significance of the estimated coefficients: Are the t-statistics greater than 2 in magnitude,
corresponding to p-values less than 0.05 If they are not, you should probably try to refit the model with the least significant variable excluded, which is the "backward stepwise" approach to model refinement.

Remember that the t-statistic is just the estimated coefficient divided by its own standard error. Thus, it measures "how many standard deviations from zero" the estimated coefficient is, and it is used to test the
hypothesis that the true value of the coefficient is non-zero, in order to confirm that the independent variable really belongs in the model.

The p-value is the probability of observing a t-statistic that large or larger in magnitude given the null hypothesis that the true coefficient value is zero. If the p-value is greater than 0.05-which occurs roughly
when the t-statistic is less than 2 in absolute value-this means that the coefficient may be only "accidentally" significant.

There's nothing magical about the 0.05 criterion, but in practice it usually turns out that a variable whose estimated coefficient has a p-value of greater than 0.05 can be dropped from the model without affecting the
error measures very much-try it and see

Question : Which graphical representation shows the distribution and multiple summary statistics of a
continuous variable for each value of a corresponding discrete variable?

1. box and whisker plot
2. dotplot
3. Access Mostly Uused Products by 50000+ Subscribers
4. binplot

Correct Answer : Get Lastest Questions and Answer :
Explanation: Statistics assumes that your data points (the numbers in your list) are clustered around some central value. The "box" in the box-and-whisker plot contains, and thereby highlights, the middle
half of these data points.

To create a box-and-whisker plot, you start by ordering your data (putting the values in numerical order), if they aren't ordered already. Then you find the median of your data. The median divides the data into two
halves. To divide the data into quarters, you then find the medians of these two halves. Note: If you have an even number of values, so the first median was the average of the two middle values, then you include the
middle values in your sub-median computations. If you have an odd number of values, so the first median was an actual data point, then you do not include that value in your sub-median computations. That is, to find
the sub-medians, you're only looking at the values that haven't yet been used.

You have three points: the first middle point (the median), and the middle points of the two halves (what I call the "sub-medians"). These three points divide the entire data set into quarters, called "quartiles". The
top point of each quartile has a name, being a "Q" followed by the number of the quarter. So the top point of the first quarter of the data points is "Q1", and so forth. Note that Q1 is also the middle number for the
first half of the list, Q2 is also the middle number for the whole list, Q3 is the middle number for the second half of the list, and Q4 is the largest value in the list.

Related Questions

Question : In linear regression modeling, which action can be taken to improve the linearity of the relationship
between the dependent and independent variables?

1. Apply a transformation to a variable
2. Use a different statistical package
3. Access Mostly Uused Products by 50000+ Subscribers
4. Change the units of measurement on the independent variable

Question : Data visualization is used in the final presentation of an analytics project. For what else is this
technique commonly used?

1. ETLT
2. Descriptive statistics
3. Access Mostly Uused Products by 50000+ Subscribers
4. Model selection

Question : You have been assigned to do a study of the daily revenue effect of a pricing model of online
transactions. All the data currently available to you has been loaded into your analytics database;
revenue data, pricing data, and online transaction data. You find that all the data comes in
different levels of granularity. The transaction data has timestamps (day, hour, minutes, seconds),
pricing is stored at the daily level, and revenue data is only reported monthly. What is your next
step?

1. Interpolate a daily model for revenue from the monthly revenue data.
2. Aggregate all data to the monthly level in order to create a monthly revenue model.
3. Access Mostly Uused Products by 50000+ Subscribers
question.
4. Disregard revenue as a driver in the pricing model, and create a daily model based on pricing
and transactions only.

Question : Which SQL OLAP extension provides all possible grouping combinations?

1. ROLLUP
2. UNION ALL
3. Access Mostly Uused Products by 50000+ Subscribers
4. CROSS JOIN

Question : What is the primary bottleneck in text classification?

1. The ability to parse unstructured text data.
2. The availablilty of tagged training data.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The fact that text corpora are dynamic.

Question : Which characteristic applies only to Business Intelligence as opposed to Data Science?

1. Uses only structured data
2. Supports solving "what if" scenarios
3. Access Mostly Uused Products by 50000+ Subscribers
4. Uses predictive modeling techniques