Question : You are working in the HadoopExam consultancy team and written a MapReduce and Pig job, which of the following is correct statement?
1. Pig comes with additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter. 2. Pig comes with no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Pig comes with additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.
Explanation: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing.Pig is a framework that translates programs written in Pig Latin into jobs that are executed by the MapReduce framework. Pig does not provide any functionality that isn't provided by MapReduce, but it makes some types of data operations significantly easier to perform. For an overview of Pig, see chapter 11 of Hadoop: The Definitive Guide, 3rd Editionde or the Wikipedia page on Pig.
Watch the training Module 11 from http://hadoopexam.com/index.html/#hadoop-training
Question : Everyday HadoopExam has a good number of subscribers, but the file size created from this information is smaller than 64MB, and same 64MB is configured as a block size on the cluster. You are running a job that will process this file as a single input split on a cluster which has no other jobs currently running, and with all settings at their default values. Each node has an equal number of open Map slots. On which node will Hadoop first attempt to run the Map task?
1. The node containing the first TaskTracker to heartbeat into the JobTracker, regardless of the location of the input split 2. The node containing the first JobTracker to heartbeat into the Namenode, regardless of the location of the input split 3. Access Mostly Uused Products by 50000+ Subscribers 4. The node containing nearest location of the input split
Correct Answer : Get Lastest Questions and Answer : Explanation: Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead. Number of Maps : The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute. Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps. The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data. Number of Reduces : The ideal reducers should be the optimal value that gets them closest to: * A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of: * Terrible performance on the next phase of the workflow * Terrible performance due to the shuffle * Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless * Destroying disk IO for no really sane reason * Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work Now, there are always exceptions and special cases. One particular special case is that if following that advice makes the next step in the workflow do ridiculous things, then we need to likely 'be an exception' in the above general rules of thumb. Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces less than heapSize). This will be fixed at some point, but until it is it provides a pretty firm upper bound. The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReduceTasks(int num).When the number of reduce tasks is set to zero, no reduce tasks are executed for that job. The intermediate data produced by the map phase is copied into HDFS as the output without modification. The intermediate data from each mapper becomes a single output file in HDFS. For more information about running a job with zero reducersWhen the number of reduce tasks is set to one, a single reduce task is executed for that job. That reducer processes all intermediate data produced by the map phase and produces a single output file in HDFS. For more information about how shuffle and sort and reduce phases work,Every TaskTracker will periodically send a heartbeat message to the JobTracker. Whenever a TaskTracker sends the heartbeat message, and there are tasks waiting to be assigned to a TaskTracker, the JobTracker will assign that TaskTracker a number of tasks equal to the number of the TaskTracker's slots that are free, up to the number of pending tasks. If the TaskTracker has fewer free slots than the number of pending tasks, the JobTracker will assign the TaskTracker the tasks that will have the minimum number of network hops to their data when executed by that TaskTracker. Ideally the TaskTracker will be assigned tasks that require data that is local to the TaskTracker node or, as the second chice, local to another node in the same rack in which the TaskTracker node is housed ("rack local"). When, however, the TaskTracker has more free slots than the number of pending tasks, then the TaskTracker will be assigned all pending tasks, regardless of where the data for those tasks is located.If a job is submitted that will process a single input split, that job will contain a single map task. As that single map task is the only pending task, the next TaskTracker that sends a heartbeat message will be assigned the task, regardless of data locality. For an explanation of the task execution process see Anatomy of a MapReduce Job Run with Hadoop. For an explanation of how input splits are computed and how they relate to map tasks, see chapter 2 in Hadoop: The Definitive Guide, 3rd Edition in the Scaling Out: Data Flow section.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : You are working on a project of HadoopExam client where you need to chain together MapReduce and Pig jobs. You also need the ability to use forks, decision points, and path joins. Which of the following ecosystem projects allows you to accomplish this?
Explanation: Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language). hPDL is a fairly compact language, using a limited amount of flow control and action nodes. Control nodes define the flow of execution and include beginning and end of a workflow (start, end and fail nodes) and mechanisms to control the workflow execution path ( decision, fork and join nodes). Workflow definitions Currently running workflow instances, including instance states and variables Oozie is a Java Web-Application that runs in a Java servlet-container - Tomcat and uses a database to storeHUE is a GUI for interacting with a Hadoop cluster. Sqoop is a tool for transferring data between HDFS and other external data stores. HBase is a distributed key-value store. Zookeeper is a distributed coordination engine. Oozie is a workflow and orchestration framework.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
1. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins. 2. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted. 3. Access Mostly Uused Products by 50000+ Subscribers optimal performance for map-only or reduce-only jobs. 4. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.