Dell EMC Data Science Associate Certification Questions and Answers (Dumps and Practice Questions)

Question : Which data asset is an example of quasi-structured data?

1. XML data file
2. Database table
3. News article
4. Webserver log

Correct Answer : Get Lastest Questions and Answer :

Explanation: Types of quasi-structured data and examples of each

totally unstructured data - google search results cover all websites, but are hard to further categorize without access the google database itself
intuitive-structure - my wordtree algorithm accepts any pasted text and yields a network map based on similarity of langauge within the text, as well as proximity of words to each other within the text. But it is not "tagged" the way youtube and flickr track content in images
emergent structure - algorithms to extract the main idea of groups of stories
pseudo-structuring - looking at content and assigning structure to all possible variations of a single document type, such as I did with the auditing tool.
guess, apply a rule, and refine - in this mode the algorithm tries an approach and refines it iteratively based on user feedback. IF the feedback is automated in the form of a score on the result, this approach becomes evolutionary programming.

These strategies for structuring Big Data have come about as a consequence of two trends. First - 100 times more content is added online each year than the sum of all books ever written in history. Second - most of this content is structured by institutions that for various reasons don't want to release the fully annotated version of the information. So pragmatic programmers like me build "wrappers" to restructure the parts that are available. Eventually there will be a universal wrapper for all content about financial records, and another one for all organization reports. These data sets will organize content into clusters that are similar enough for us to study patterns on a global scale. That's when "big data" begins to get interesting. Today, we're in the early stages of deconstructing the structure so that we can reconstruct larger data sets from the individual parts that each have unique yet "incompatible" structures. It is like taking apart all the cars in a junk yard so we can categorize all the parts and deliver them to customers that want to build fresh cars. You see cars go in and cars go out, but a lot happens in between.

Last year, if someone had asked you to track all the work you do on your computer, you would have probably filled out a survey (like the "time tracking" reports I fill out monthly at work). In the future your computer will fill them out for you and in greater detail, and these data will be "mashable" with other reporting systems. This will not happen because two systems are built to work together, but instead because someone build a third system that allows two systems to share information. Eventually we will build "genetic algorithms" that will write programs that can re-organize data into usable structures regardless of how the original data was structured. This is going to happen in the next ten years and we will ask ourselves why we didn't do it sooner.

Question : What would be considered "Big Data"?

1. An OLAP Cube containing customer demographic information about 100, 000, 000 customers

2. Aggregated statistical data stored in a relational database table

3. Access Mostly Uused Products by 50000+ Subscribers

4. Spreadsheets containing monthly sales data for a Global 100 corporation

Correct Answer : Get Lastest Questions and Answer :

Explanation: Information sets that approach the size of all information known about "X". For example, instead of a sample of e-books, it means a comprehensive set of all e-books ever written (~70% to N=ALL). Big Data sets are noisier yet do not require us to know beforehand what questions we will pose of it. We can drill down in Big Data sets and ask arbitrary questions. It is a complementary method to statistics, which rely on sampling to eliminate bias through random sampling. Instead, Big Data assumes bias and quantifies what the biases are in the data set, so that they can be detected, inspected, and corrected.

Question : A data scientist plans to classify the sentiment polarity of , product reviews collected from
the Internet. What is the most appropriate model to use? Suppose labeled training data is
available.

1. Linear regression

2. Logistic regression

3. Access Mostly Uused Products by 50000+ Subscribers
4. Naive Bayesian classifier

Correct Answer : Get Lastest Questions and Answer :

Explanation: Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features.

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods.

Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers.[5] Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches, such as boosted trees or random forests.[6]

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification

Related Questions

Question : How does Pig's use of a schema differ from that of a traditional RDBMS?

1. Pig's schema requires that the data is physically present when the schema is defined
2. Pig's schema is required for ETL
3. Access Mostly Uused Products by 50000+ Subscribers
4. Pig's schema is optional

Question : You are provided four different datasets. Initial analysis on these datasets show that they have
identical mean, variance and correlation values. What should your next step in the analysis be?

1. Select one of the four datasets and begin planning and building a model
2. Combine the data from all four of the datasets and begin planning and bulding a model
3. Access Mostly Uused Products by 50000+ Subscribers
4. Visualize the data to further explore the characteristics of each data set

Question : You are asked to create a model to predict the total number of monthly subscribers for a specific
magazine. You are provided with 1 year's worth of subscription and payment data, user
demographic data, and 10 years worth of content of the magazine (articles and pictures). Which
algorithm is the most appropriate for building a predictive model for subscribers?

1. TF-IDF
2. Linear regression
3. Access Mostly Uused Products by 50000+ Subscribers
4. Decision trees

Question : Which word or phrase completes the statement? Structured data is to OLAP data as quasistructured
data is to____

1. Text documents
2. XML data
3. Access Mostly Uused Products by 50000+ Subscribers
4. Image files

Question : What describes a true property of Logistic Regression method?

1. It handles missing values well.
2. It works well with discrete variables that have many distinct values.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It works well with variables that affect the outcome in a discontinuous way.

Question : You have been assigned to do a study of the daily revenue effect of a pricing model of online
transactions. You have tested all the theoretical models in the previous model planning stage, and
all tests have yielded statistically insignificant results. What is your next step?

1. Run all the models again against a larger sample, leveraging more historical data.
2. Report that the results are insignificant, and reevaluate the original business question.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Modify samples used by the models and iterate until a significant result occurs.