Question : A disk drive manufacturer has a defect rate of less than .% with % confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend?
1. A larger sample size should be taken to determine if the plant is operating correctly 2. The manufacturing process is functioning properly and no further action is required 3. A smaller sample size should be taken to determine if the plant is operating correctly 4. There is a flaw in the quality assurance process and the sample should be repeated
Correct Answer : 2 Explanation: Defect rate is less than 1.5% (which is equal to 15 out of 1000 and confidence level is also high 98%) So need to take action.
Question : What is a core deliverable at the end of the analytic project? 1. An implemented database design 2. A whitepaper describing the project and the implementation 3. A presentation for project sponsors 4. The training materials
Correct Answer : 3 Explanation: Four main deliverables: Presentation for Project Sponsors contains high-level takeaways for executivelevel stakeholders, with a few key messages to aid their decision-making process. Focus on clean, easy visuals for the presenter to explain and for the viewer to grasp.
Presentation for Analysts, which describes changes to business processes and reports. Data scientists reading this presentation are comfortable with technical graphs (such as Receiver Operating Characteristic [ROC] curves, density plots, and histograms) and will be interested in the details.
Code for technical people, such as engineers and others managing the production environment
Technical specifications for implementing the code
Question : You have been assigned to run a logistic regression model for each of countries, and all the data is currently stored in a PostgreSQL database. Which tool/library would you use to produce these models with the least effort? 1. RStudio 2. MADlib 3. RStudio 4. HBase
Correct Answer : 2 Explanation: MADlib is an open-source library for scalable in-database analytics. It offers dataparallel implementations of mathematical, statistical, and machine learning methods for structured and unstructured data. Because MADlib is designed and built to accommodate massive parallel processing of data, MADlib is ideal for Big Data in-database analytics. MADlib supports the opensource database PostgreSQL as well as the Pivotal Greenplum Database and Pivotal HAWQ. HAWQ is a SQL query engine for data stored in the Hadoop Distributed File System (HDFS). Module Description Generalized Linear Models : Includes linear regression, logistic regression, and multinomial logistic regression
Question : What is the purpose of the process step "parsing" in text analysis? 1. computes the TF-IDF values for all keywords and indices 2. executes the clustering and classification to organize the contents 3. performs the search and/or retrieval in finding a specific topic or an entity in a document 4. imposes a structure on the unstructured/semi-structured text for downstream analysis