Explanation: Apache Pig consists of a data flow language, Pig Latin, and an environment to execute the Pig code. The main benefit of using Pig is to utilize the power of MapReduce in a distributed system, while simplifying the tasks of developing and executing a MapReduce job. In most cases, it is transparent to the user that a MapReduce job is running in the background when Pig commands are executed. This abstraction layer on top of Hadoop simplifies the development of code against data in HDFS and makes MapReduce more accessible to a larger audience. With Apache Hadoop and Pig already installed, the basics of using Pig include entering the Pig execution environment by typing pig at the command prompt and then entering a sequence of Pig instruction lines at the grunt prompt.
Question : A call center for a large electronics company handles an average of , support calls a day. The head of the call center would like to optimize the staffing of the call center during the rollout of a new product due to recent customer complaints of long wait times. You have been asked to create a model to optimize call center costs and customer wait times. The goals for this project include: 1. Relative to the release of a product, how does the call volume change over time? 2. How to best optimize staffing based on the call volume for the newly released product, relative to old products. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Determine the frequency of calls by both product type and customer language. Which goals are suitable to be completed with MapReduce?
Question : Consider the example of an analysis for fraud detection on credit card usage. You will need to ensure higher-risk transactions that may indicate fraudulent credit card activity are retained in your data for analysis, and not dropped as outliers during pre-processing. What will be your approach for loading data into the analytical sandbox for this analysis?
Explanation: Phase 2-Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data
1. Computationally inexpensive, easy to implement, knowledge representation easy to interpret 2. May have low accuracy 3. Access Mostly Uused Products by 50000+ Subscribers 4. Only 1 and 3 are correct 5. All 1,2 and 3 are correct