Question : Which data asset is an example of quasi-structured data? 1. Webserver log 2. XML data file 3. Access Mostly Uused Products by 50000+ Subscribers 4. News article Ans : 1 Exp : Types of quasi-structured data and examples of each
totally unstructured data - google search results cover all websites, but are hard to further categorize without access the google database itself intuitive-structure - my wordtree algorithm accepts any pasted text and yields a network map based on similarity of langauge within the text, as well as proximity of words to each other within the text. But it is not "tagged" the way youtube and flickr track content in images emergent structure - algorithms to extract the main idea of groups of stories pseudo-structuring - looking at content and assigning structure to all possible variations of a single document type, such as I did with the auditing tool. guess, apply a rule, and refine - in this mode the algorithm tries an approach and refines it iteratively based on user feedback. IF the feedback is automated in the form of a score on the result, this approach becomes evolutionary programming. (I am still figuring out how to describe this - so some of these above examples may be the same thing.)
These strategies for structuring Big Data have come about as a consequence of two trends. First - 100 times more content is added online each year than the sum of all books ever written in history. Second - most of this content is structured by institutions that for various reasons don't want to release the fully annotated version of the information. So pragmatic programmers like me build "wrappers" to restructure the parts that are available. Eventually there will be a universal wrapper for all content about financial records, and another one for all organization reports. These data sets will organize content into clusters that are similar enough for us to study patterns on a global scale. That's when "big data" begins to get interesting. Today, we're in the early stages of deconstructing the structure so that we can reconstruct larger data sets from the individual parts that each have unique yet "incompatible" structures. It is like taking apart all the cars in a junk yard so we can categorize all the parts and deliver them to customers that want to build fresh cars. You see cars go in and cars go out, but a lot happens in between.
Last year, if someone had asked you to track all the work you do on your computer, you would have probably filled out a survey (like the "time tracking" reports I fill out monthly at work). In the future your computer will fill them out for you and in greater detail, and these data will be "mashable" with other reporting systems. This will not happen because two systems are built to work together, but instead because someone build a third system that allows two systems to share information. Eventually we will build "genetic algorithms" that will write programs that can re-organize data into usable structures regardless of how the original data was structured. This is going to happen in the next ten years and we will ask ourselves why we didn't do it sooner.
Question : What would be considered "Big Data"?
1. An OLAP Cube containing customer demographic information about 100, 000, 000 customers
2. Daily Log files from a web server that receives 100, 000 hits per minute
4. Spreadsheets containing monthly sales data for a Global 100 corporation
Ans : 2 Exp :
Question : A data scientist plans to classify the sentiment polarity of , product reviews collected from the Internet. What is the most appropriate model to use? Suppose labeled training data is available. 1. Naive Bayesian classifier 2. Linear regression 3. Access Mostly Uused Products by 50000+ Subscribers 4. K-means clustering Ans : 1 Exp :
Question : Your company has different sales teams. Each team's sales manager has developed incentive offers to increase the size of each sales transaction. Any sales manager whose incentive program can be shown to increase the size of the average sales transaction will receive a bonus. Data are available for the number and average sale amount for transactions offering one of the incentives as well as transactions offering no incentive. The VP of Sales has asked you to determine analytically if any of the incentive programs has resulted in a demonstrable increase in the average sale amount. Which analytical technique would be appropriate in this situation? 1. One-way ANOVA
Question : Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has a strong background in data flow languages and programming. Which query interface would you recommend? 1. Pig 2. Hive 3. Access Mostly Uused Products by 50000+ Subscribers 4. HBase
The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in a production single-instance JDBC database. They collaborate with the production team to import the data into Hadoop. Which tool should they use? 1. Sqoop 2. Pig 3. Access Mostly Uused Products by 50000+ Subscribers 4. Scribe
Ans : 1
Explanation:
Question : The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in their massively parallel database. Which tool should they use to export the structured data from Hadoop?
Question : When would you prefer a Naive Bayes model to a logistic regression model for classification?
1. When you are using several categorical input variables with over 1000 possible values each. 2. When you need to estimate the probability of an outcome, not just which class it is in. 3. Access Mostly Uused Products by 50000+ Subscribers 4. When some of the input variables might be correlated. Ans : 1 Exp :
Question : You have fit a decision tree classifier using input variables. The resulting tree used of the variables, and is 5 levels deep. Some of the nodes contain only 3 data points. The AUC of the model is 0.85. What is your evaluation of this model? 1. The tree is probably overfit. Try fitting shallower trees and using an ensemble method. 2. The AUC is high, and the small nodes are all very pure. This is an accurate model. 3. Access Mostly Uused Products by 50000+ Subscribers accurate model 4. The AUC is high, so the overall model is accurate. It is not well-calibrated, because the small nodes will give poor estimates of probability.
Ans : 1 Exp :
Question : If your intention is to show trends over time, which chart type is the most appropriate way to depict the data?
Question : You are testing two new weight-gain formulas for puppies. The test gives the results: Control group: 1% weight gain Formula A. 3% weight gain Formula B. 4% weight gain A one-way ANOVA returns a p-value = 0.027 What can you conclude? 1. Either Formula A or Formula B is effective at promoting weight gain. 2. Formula B is more effective at promoting weight gain than Formula A. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Formula A and Formula B are about equally effective at promoting weight gain.
Ans : 1 Exp :
Question : Data visualization is used in the final presentation of an analytics project. For what else is this technique commonly used?
Question : The average purchase size from your online sales site is $, . The customer experience team believes a certain adjustment of the website will increase sales. A pilot study on a few hundred customers showed an increase in average purchase size of $1.47, with a significance level of p=0.1. The team runs a larger study, of a few thousand customers. The second study shows an increased average purchase size of $0.74, with a significance level of 0.03. What is your assessment of this study? 1. The change in purchase size is not practically important, and the good p-value of the second study is probably a result of the large study size. 2. The change in purchase size is small, but may aggregate up to a large increase in profits over the entire customer base. 3. Access Mostly Uused Products by 50000+ Subscribers should run another, larger study. 4. The p-value of the second study shows a statistically significant change in purchase size. The new website is an improvement. Ans : 1 Exp :
Question : Which word or phrase completes the statement? Business Intelligence is to monitoring trends as Data Science is to ________ trends. 1. Predicting 2. Discarding 3. Access Mostly Uused Products by 50000+ Subscribers 4. Optimizing
Ans : 1 Exp :
Question : Consider a scale that has five () values that range from "not important" to "very important". Which data classification best describes this data?
Question : You have used k-means clustering to classify behavior of , customers for a retail store. You decide to use household income, age, gender and yearly purchase amount as measures. You have chosen to use 8 clusters and notice that 2 clusters only have 3 customers assigned. What should you do? 1. Decrease the number of clusters 2. Increase the number of clusters 3. Access Mostly Uused Products by 50000+ Subscribers 4. Identify additional measures to add to the analysis
Ans : 1 Exp :
Question For which class of problem is MapReduce most suitable?
Question You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%. What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited?
Question Refer to exhibit. You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables-A, B, and C-have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional datA. what is a way that you could try to increase the R2 of the model without artificially inflating it? 1. Create clusters based on the data and use them as model inputs 2. Force all 15 variables into the model as independent variables 3. Access Mostly Uused Products by 50000+ Subscribers 4. Break variables A, B, and C into their own univariate models
Ans : 1 Exp :
Question You are given , , user profile pages of an online dating site in XML files, and they are stored in HDFS. You are assigned to divide the users into groups based on the content of their profiles. You have been instructed to try K-means clustering on this data. How should you proceed? 1. Run MapReduce to transform the data, and find relevant key value pairs. 2. Divide the data into sets of 1, 000 user profiles, and run K-means clustering in RHadoop iteratively. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Partition the data by XML file size, and run K-means clustering in each partition.
Ans : 1 Exp :
Question : A call center for a large electronics company handles an average of , support calls a day. The head of the call center would like to optimize the staffing of the call center during the rollout of a new product due to recent customer complaints of long wait times. You have been asked to create a model to optimize call center costs and customer wait times. The goals for this project include: 1. Relative to the release of a product, how does the call volume change over time? 2. How to best optimize staffing based on the call volume for the newly released product, relative to old products. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Determine the frequency of calls by both product type and customer language. Which goals are suitable to be completed with MapReduce? 1. 2,4 2. 1,3 3. Access Mostly Uused Products by 50000+ Subscribers 4. 2,3,4
Ans : 1 Exp :
Question : You are studying the behavior of a population, and you are provided with multidimensional data at the individual level. You have identified four specific individuals who are valuable to your study, and would like to find all users who are most similar to each individual. Which algorithm is the most appropriate for this study? 1. K-means clustering 2. Linear regression 3. Access Mostly Uused Products by 50000+ Subscribers 4. Decision trees
Ans : 1 Exp :
Question A disk drive manufacturer has a defect rate of less than .% with % confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend?
1. The manufacturing process should be inspected for problems. 2. A larger sample size should be taken to determine if the plant is functioning properly 3. Access Mostly Uused Products by 50000+ Subscribers 4. The manufacturing process is functioning properly and no further action is required. Ans : 1 Exp :
Question A data scientist wants to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate method for this project? 1. Linear regression 2. K-means clustering 3. Access Mostly Uused Products by 50000+ Subscribers 4. Logistic regression
Ans : 4 Exp :
Question : What are the characteristics of Big Data?
1. Data volume, business importance, and data structure variety. 2. Data volume, processing complexity, and data structure variety. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Data volume, processing complexity, and business importance. Ans : 2 Exp :
Question You are analyzing data in order to build a classifier model. You discover non-linear data and discontinuities that will affect the model. Which analytical method would you recommend?
Question : What is the format of the output from the Map function of MapReduce? 1. Key-value pairs 2. Binary respresentation of keys concatenated with structured data 3. Access Mostly Uused Products by 50000+ Subscribers 4. Unique key record and separate records of all possible values Ans : 1
Question : Which data type value is used for the observed response variable in a logistic regression model? 1. Any positive real number 2. Any integer 3. Access Mostly Uused Products by 50000+ Subscribers 4. Any real number Ans : 3 Exp :
Question : In linear regression modeling, which action can be taken to improve the linearity of the relationship between the dependent and independent variables? 1. Apply a transformation to a variable 2. Use a different statistical package 3. Access Mostly Uused Products by 50000+ Subscribers 4. Change the units of measurement on the independent variable
Ans : 1 Exp :
Question What is the primary bottleneck in text classification? 1. The ability to parse unstructured text data. 2. The high dimensionality of text data. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The fact that text corpora are dynamic.
Ans : 3 Exp :
Question : Your customer provided you with , unlabeled records and asked you to separate them into three groups. What is the correct analytical method to use?
Question : How does Pig's use of a schema differ from that of a traditional RDBMS? 1. Pig's schema is optional 2. Pig's schema requires that the data is physically present when the schema is defined 3. Access Mostly Uused Products by 50000+ Subscribers 4. Pig's schema supports a single data type
Ans : 1 Exp :
Question :You are asked to create a model to predict the total number of monthly subscribers for a specific magazine. You are provided with 1 year's worth of subscription and payment data, user demographic data, and 10 years worth of content of the magazine (articles and pictures). Which algorithm is the most appropriate for building a predictive model for subscribers? 1. Logistic regression 2. Decision trees 3. Access Mostly Uused Products by 50000+ Subscribers 4. Linear regression
Ans : 4 Exp :
Question : What describes a true property of Logistic Regression method? 1. It is robust with redundant variables and correlated variables. 2. It handles missing values well. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It works well with variables that affect the outcome in a discontinuous way.
Ans :1 Exp :
Question : A data scientist is asked to implement an article recommendation feature for an on-line magazine. The magazine does not want to use client tracking technologies such as cookies or reading history. Therefore, only the style and subject matter of the current article is available for making recommendations. All of the magazine's articles are stored in a database in a format suitable for analytics. Which method should the data scientist try first? 1. Association Rules 2. Naive Bayesian 3. Access Mostly Uused Products by 50000+ Subscribers 4. K Means Clustering
Ans : 4 Exp :
Question : While having a discussion with your colleague, this person mentions that they want to perform Kmeans clustering on text file data stored in HDFS. Which tool would you recommend to this colleague? 1. Sqoop 2. HBase 3. Access Mostly Uused Products by 50000+ Subscribers 4. Scribe
Ans : 3 Exp :
Question : What describes a true limitation of Logistic Regression method? 1. It does not handle redundant variables well. 2. It does not handle missing values well. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It does not have explanatory values.
Ans : 2 Exp :
Question : You submit a MapReduce job to a Hadoop cluster and notice that although the job was successfully submitted, it is not completing. What should you do? 1. Ensure that the JobTracker is running 2. Ensure that the TaskTracker is running. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Ensure that a DataNode is running
Ans : 2 Exp :
Question : A disk drive manufacturer has a defect rate of less than .% with % confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend? 1. A smaller sample size should be taken to determine if the plant is operating correctly 2. A larger sample size should be taken to determine if the plant is operating correctly 3. Access Mostly Uused Products by 50000+ Subscribers 4. There is a flaw in the quality assurance process and the sample should be repeated
Ans : 3 Exp :
Question : Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision. Which analysis method should you use?
Question : What describes the use of UNION clause in a SQL statement?
1. Operates on queries and potentially increases the number of rows 2. Operates on queries and potentially decreases the number of rows 3. Access Mostly Uused Products by 50000+ Subscribers 4. Operates on both tables and queries and potentially increases both the number of rows and columns
Ans : 1 Exp :
Question : In the MapReduce framework, what is the purpose of the Reduce function? 1. It distributes the input to multiple nodes for processing 2. It writes the output of the Map function to storage 3. Access Mostly Uused Products by 50000+ Subscribers 4. It breaks the input into smaller components and distributes to other nodes in the cluster
Question : Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has previously worked extensively with SQL and databases. Which query interface would you recommend? 1. HBase 2. Crunch 3. Access Mostly Uused Products by 50000+ Subscribers 4. Hive
totally unstructured data - google search results cover all websites, but are hard to further categorize without access the google database itself intuitive-structure - my wordtree algorithm accepts any pasted text and yields a network map based on similarity of langauge within the text, as well as proximity of words to each other within the text. But it is not "tagged" the way youtube and flickr track content in images emergent structure - algorithms to extract the main idea of groups of stories pseudo-structuring - looking at content and assigning structure to all possible variations of a single document type, such as I did with the auditing tool. guess, apply a rule, and refine - in this mode the algorithm tries an approach and refines it iteratively based on user feedback. IF the feedback is automated in the form of a score on the result, this approach becomes evolutionary programming. (I am still figuring out how to describe this - so some of these above examples may be the same thing.)
These strategies for structuring Big Data have come about as a consequence of two trends. First - 100 times more content is added online each year than the sum of all books ever written in history. Second - most of this content is structured by institutions that for various reasons don't want to release the fully annotated version of the information. So pragmatic programmers like me build "wrappers" to restructure the parts that are available. Eventually there will be a universal wrapper for all content about financial records, and another one for all organization reports. These data sets will organize content into clusters that are similar enough for us to study patterns on a global scale. That's when "big data" begins to get interesting. Today, we're in the early stages of deconstructing the structure so that we can reconstruct larger data sets from the individual parts that each have unique yet "incompatible" structures. It is like taking apart all the cars in a junk yard so we can categorize all the parts and deliver them to customers that want to build fresh cars. You see cars go in and cars go out, but a lot happens in between.
Last year, if someone had asked you to track all the work you do on your computer, you would have probably filled out a survey (like the "time tracking" reports I fill out monthly at work). In the future your computer will fill them out for you and in greater detail, and these data will be "mashable" with other reporting systems. This will not happen because two systems are built to work together, but instead because someone build a third system that allows two systems to share information. Eventually we will build "genetic algorithms" that will write programs that can re-organize data into usable structures regardless of how the original data was structured. This is going to happen in the next ten years and we will ask ourselves why we didn't do it sooner.
Question : What would be considered "Big Data"?
1. An OLAP Cube containing customer demographic information about 100, 000, 000 customers
2. Daily Log files from a web server that receives 100, 000 hits per minute
4. Spreadsheets containing monthly sales data for a Global 100 corporation
Ans : 2 Exp :
Question : A data scientist plans to classify the sentiment polarity of , product reviews collected from the Internet. What is the most appropriate model to use? Suppose labeled training data is available. 1. Naive Bayesian classifier 2. Linear regression 3. Access Mostly Uused Products by 50000+ Subscribers 4. K-means clustering Ans : 1 Exp :
Question : Your company has different sales teams. Each team's sales manager has developed incentive offers to increase the size of each sales transaction. Any sales manager whose incentive program can be shown to increase the size of the average sales transaction will receive a bonus. Data are available for the number and average sale amount for transactions offering one of the incentives as well as transactions offering no incentive. The VP of Sales has asked you to determine analytically if any of the incentive programs has resulted in a demonstrable increase in the average sale amount. Which analytical technique would be appropriate in this situation? 1. One-way ANOVA
Question : Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has a strong background in data flow languages and programming. Which query interface would you recommend? 1. Pig 2. Hive 3. Access Mostly Uused Products by 50000+ Subscribers 4. HBase
Question : When would you prefer a Naive Bayes model to a logistic regression model for classification?
1. When you are using several categorical input variables with over 1000 possible values each. 2. When you need to estimate the probability of an outcome, not just which class it is in. 3. Access Mostly Uused Products by 50000+ Subscribers 4. When some of the input variables might be correlated. Ans : 1 Exp :
Question : You have fit a decision tree classifier using input variables. The resulting tree used of the variables, and is 5 levels deep. Some of the nodes contain only 3 data points. The AUC of the model is 0.85. What is your evaluation of this model? 1. The tree is probably overfit. Try fitting shallower trees and using an ensemble method. 2. The AUC is high, and the small nodes are all very pure. This is an accurate model. 3. Access Mostly Uused Products by 50000+ Subscribers accurate model 4. The AUC is high, so the overall model is accurate. It is not well-calibrated, because the small nodes will give poor estimates of probability.
Ans : 1 Exp :
Question : If your intention is to show trends over time, which chart type is the most appropriate way to depict the data?
Question : You are testing two new weight-gain formulas for puppies. The test gives the results: Control group: 1% weight gain Formula A. 3% weight gain Formula B. 4% weight gain A one-way ANOVA returns a p-value = 0.027 What can you conclude? 1. Either Formula A or Formula B is effective at promoting weight gain. 2. Formula B is more effective at promoting weight gain than Formula A. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Formula A and Formula B are about equally effective at promoting weight gain.
Ans : 1 Exp :
Question : Data visualization is used in the final presentation of an analytics project. For what else is this technique commonly used?
Question : The average purchase size from your online sales site is $, . The customer experience team believes a certain adjustment of the website will increase sales. A pilot study on a few hundred customers showed an increase in average purchase size of $1.47, with a significance level of p=0.1. The team runs a larger study, of a few thousand customers. The second study shows an increased average purchase size of $0.74, with a significance level of 0.03. What is your assessment of this study? 1. The change in purchase size is not practically important, and the good p-value of the second study is probably a result of the large study size. 2. The change in purchase size is small, but may aggregate up to a large increase in profits over the entire customer base. 3. Access Mostly Uused Products by 50000+ Subscribers should run another, larger study. 4. The p-value of the second study shows a statistically significant change in purchase size. The new website is an improvement. Ans : 1 Exp :
Question : Which word or phrase completes the statement? Business Intelligence is to monitoring trends as Data Science is to ________ trends. 1. Predicting 2. Discarding 3. Access Mostly Uused Products by 50000+ Subscribers 4. Optimizing
Ans : 1 Exp :
Question : Consider a scale that has five () values that range from "not important" to "very important". Which data classification best describes this data?
Question : You have used k-means clustering to classify behavior of , customers for a retail store. You decide to use household income, age, gender and yearly purchase amount as measures. You have chosen to use 8 clusters and notice that 2 clusters only have 3 customers assigned. What should you do? 1. Decrease the number of clusters 2. Increase the number of clusters 3. Access Mostly Uused Products by 50000+ Subscribers 4. Identify additional measures to add to the analysis
Ans : 1 Exp :
Question For which class of problem is MapReduce most suitable?
Question You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%. What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited?
Question Refer to exhibit. You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables-A, B, and C-have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional datA. what is a way that you could try to increase the R2 of the model without artificially inflating it? 1. Create clusters based on the data and use them as model inputs 2. Force all 15 variables into the model as independent variables 3. Access Mostly Uused Products by 50000+ Subscribers 4. Break variables A, B, and C into their own univariate models
Ans : 1 Exp :
Question You are given , , user profile pages of an online dating site in XML files, and they are stored in HDFS. You are assigned to divide the users into groups based on the content of their profiles. You have been instructed to try K-means clustering on this data. How should you proceed? 1. Run MapReduce to transform the data, and find relevant key value pairs. 2. Divide the data into sets of 1, 000 user profiles, and run K-means clustering in RHadoop iteratively. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Partition the data by XML file size, and run K-means clustering in each partition.
Ans : 1 Exp :
Question : A call center for a large electronics company handles an average of , support calls a day. The head of the call center would like to optimize the staffing of the call center during the rollout of a new product due to recent customer complaints of long wait times. You have been asked to create a model to optimize call center costs and customer wait times. The goals for this project include: 1. Relative to the release of a product, how does the call volume change over time? 2. How to best optimize staffing based on the call volume for the newly released product, relative to old products. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Determine the frequency of calls by both product type and customer language. Which goals are suitable to be completed with MapReduce? 1. 2,4 2. 1,3 3. Access Mostly Uused Products by 50000+ Subscribers 4. 2,3,4
Ans : 1 Exp :
Question : You are studying the behavior of a population, and you are provided with multidimensional data at the individual level. You have identified four specific individuals who are valuable to your study, and would like to find all users who are most similar to each individual. Which algorithm is the most appropriate for this study? 1. K-means clustering 2. Linear regression 3. Access Mostly Uused Products by 50000+ Subscribers 4. Decision trees
Ans : 1 Exp :
Question A disk drive manufacturer has a defect rate of less than .% with % confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend?
1. The manufacturing process should be inspected for problems. 2. A larger sample size should be taken to determine if the plant is functioning properly 3. Access Mostly Uused Products by 50000+ Subscribers 4. The manufacturing process is functioning properly and no further action is required. Ans : 1 Exp :
Question A data scientist wants to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate method for this project? 1. Linear regression 2. K-means clustering 3. Access Mostly Uused Products by 50000+ Subscribers 4. Logistic regression
Ans : 4 Exp :
Question : What are the characteristics of Big Data?
1. Data volume, business importance, and data structure variety. 2. Data volume, processing complexity, and data structure variety. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Data volume, processing complexity, and business importance. Ans : 2 Exp :
Question You are analyzing data in order to build a classifier model. You discover non-linear data and discontinuities that will affect the model. Which analytical method would you recommend?
Question : What is the format of the output from the Map function of MapReduce? 1. Key-value pairs 2. Binary respresentation of keys concatenated with structured data 3. Access Mostly Uused Products by 50000+ Subscribers 4. Unique key record and separate records of all possible values Ans : 1
Question : Which data type value is used for the observed response variable in a logistic regression model? 1. Any positive real number 2. Any integer 3. Access Mostly Uused Products by 50000+ Subscribers 4. Any real number Ans : 3 Exp :
Question : In linear regression modeling, which action can be taken to improve the linearity of the relationship between the dependent and independent variables? 1. Apply a transformation to a variable 2. Use a different statistical package 3. Access Mostly Uused Products by 50000+ Subscribers 4. Change the units of measurement on the independent variable
Ans : 1 Exp :
Question What is the primary bottleneck in text classification? 1. The ability to parse unstructured text data. 2. The high dimensionality of text data. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The fact that text corpora are dynamic.
Ans : 3 Exp :
Question : Your customer provided you with , unlabeled records and asked you to separate them into three groups. What is the correct analytical method to use?
Question : How does Pig's use of a schema differ from that of a traditional RDBMS? 1. Pig's schema is optional 2. Pig's schema requires that the data is physically present when the schema is defined 3. Access Mostly Uused Products by 50000+ Subscribers 4. Pig's schema supports a single data type
Ans : 1 Exp :
Question :You are asked to create a model to predict the total number of monthly subscribers for a specific magazine. You are provided with 1 year's worth of subscription and payment data, user demographic data, and 10 years worth of content of the magazine (articles and pictures). Which algorithm is the most appropriate for building a predictive model for subscribers? 1. Logistic regression 2. Decision trees 3. Access Mostly Uused Products by 50000+ Subscribers 4. Linear regression
Ans : 4 Exp :
Question : What describes a true property of Logistic Regression method? 1. It is robust with redundant variables and correlated variables. 2. It handles missing values well. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It works well with variables that affect the outcome in a discontinuous way.
Ans :1 Exp :
Question : A data scientist is asked to implement an article recommendation feature for an on-line magazine. The magazine does not want to use client tracking technologies such as cookies or reading history. Therefore, only the style and subject matter of the current article is available for making recommendations. All of the magazine's articles are stored in a database in a format suitable for analytics. Which method should the data scientist try first? 1. Association Rules 2. Naive Bayesian 3. Access Mostly Uused Products by 50000+ Subscribers 4. K Means Clustering
Ans : 4 Exp :
Question : While having a discussion with your colleague, this person mentions that they want to perform Kmeans clustering on text file data stored in HDFS. Which tool would you recommend to this colleague? 1. Sqoop 2. HBase 3. Access Mostly Uused Products by 50000+ Subscribers 4. Scribe
Ans : 3 Exp :
Question : What describes a true limitation of Logistic Regression method? 1. It does not handle redundant variables well. 2. It does not handle missing values well. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It does not have explanatory values.
Ans : 2 Exp :
Question : You submit a MapReduce job to a Hadoop cluster and notice that although the job was successfully submitted, it is not completing. What should you do? 1. Ensure that the JobTracker is running 2. Ensure that the TaskTracker is running. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Ensure that a DataNode is running
Ans : 2 Exp :
Question : A disk drive manufacturer has a defect rate of less than .% with % confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend? 1. A smaller sample size should be taken to determine if the plant is operating correctly 2. A larger sample size should be taken to determine if the plant is operating correctly 3. Access Mostly Uused Products by 50000+ Subscribers 4. There is a flaw in the quality assurance process and the sample should be repeated
Ans : 3 Exp :
Question : Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision. Which analysis method should you use?
Question : What describes the use of UNION clause in a SQL statement?
1. Operates on queries and potentially increases the number of rows 2. Operates on queries and potentially decreases the number of rows 3. Access Mostly Uused Products by 50000+ Subscribers 4. Operates on both tables and queries and potentially increases both the number of rows and columns
Ans : 1 Exp :
Question : In the MapReduce framework, what is the purpose of the Reduce function? 1. It distributes the input to multiple nodes for processing 2. It writes the output of the Map function to storage 3. Access Mostly Uused Products by 50000+ Subscribers 4. It breaks the input into smaller components and distributes to other nodes in the cluster
Question : Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has previously worked extensively with SQL and databases. Which query interface would you recommend? 1. HBase 2. Crunch 3. Access Mostly Uused Products by 50000+ Subscribers 4. Hive