Question : ___________ is an optimization technique where a computer system performs some task that may not be actually needed. The main idea is to do work before it is known whether that work will be needed at all, so as to prevent a delay that would have to be incurred by doing the work after it is known whether it is needed. If it turns out the work was not needed after all, the results are ignored. The Hadoop framework also provides a mechanism to handle machine issues such as faulty configuration or hardware failure. The JobTracker detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. This behaviour is known as ________________
Correct Answer : Get Lastest Questions and Answer : Explanation: Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes. By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.The MapReduce framework monitors the progress of tasks. When one or more tasks are not making sufficiently fast progress, additional copies of the tasks will be speculaively executed. For additional information on speculative execution, see chapter 4 in Hadoop: The Definitive Guide, 3rd Edition in the Task Execution: Speculative Execution section.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : You are working in the HadoopExam consultency team and written a MapReduce and Pig job, which of the following is correct statement?
1. Pig comes with additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter. 2. Pig comes with no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Pig comes with additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.
Explanation: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing.Pig is a framework that translates programs written in Pig Latin into jobs that are executed by the MapReduce framework. Pig does not provide any functionality that isn't provided by MapReduce, but it makes some types of data operations significantly easier to perform. For an overview of Pig, see chapter 11 of Hadoop: The Definitive Guide, 3rd Editionde or the Wikipedia page on Pig.
Watch the training Module 11 from http://hadoopexam.com/index.html/#hadoop-training
Question : Everyday HadoopExam has a good number of subscribers, but the file size created from this information is smaller than 64MB, and same 64MB is configured as a block size on the cluster. You are running a job that will process this file as a single input split on a cluster which has no other jobs currently running, and with all settings at their default values. Each node has an equal number of open Map slots. On which node will Hadoop first attempt to run the Map task?
1. The node containing the first TaskTracker to heartbeat into the JobTracker, regardless of the location of the input split 2. The node containing the first JobTracker to heartbeat into the Namenode, regardless of the location of the input split 3. Access Mostly Uused Products by 50000+ Subscribers 4. The node containing nearest location of the input split
Correct Answer : Get Lastest Questions and Answer : Explanation: Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead. Number of Maps : The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute. Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps. The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data. Number of Reduces : The ideal reducers should be the optimal value that gets them closest to: * A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of: * Terrible performance on the next phase of the workflow * Terrible performance due to the shuffle * Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless * Destroying disk IO for no really sane reason * Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work Now, there are always exceptions and special cases. One particular special case is that if following that advice makes the next step in the workflow do ridiculous things, then we need to likely 'be an exception' in the above general rules of thumb. Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces less than heapSize). This will be fixed at some point, but until it is it provides a pretty firm upper bound. The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReduceTasks(int num).When the number of reduce tasks is set to zero, no reduce tasks are executed for that job. The intermediate data produced by the map phase is copied into HDFS as the output without modification. The intermediate data from each mapper becomes a single output file in HDFS. For more information about running a job with zero reducersWhen the number of reduce tasks is set to one, a single reduce task is executed for that job. That reducer processes all intermediate data produced by the map phase and produces a single output file in HDFS. For more information about how shuffle and sort and reduce phases work,Every TaskTracker will periodically send a heartbeat message to the JobTracker. Whenever a TaskTracker sends the heartbeat message, and there are tasks waiting to be assigned to a TaskTracker, the JobTracker will assign that TaskTracker a number of tasks equal to the number of the TaskTracker's slots that are free, up to the number of pending tasks. If the TaskTracker has fewer free slots than the number of pending tasks, the JobTracker will assign the TaskTracker the tasks that will have the minimum number of network hops to their data when executed by that TaskTracker. Ideally the TaskTracker will be assigned tasks that require data that is local to the TaskTracker node or, as the second chice, local to another node in the same rack in which the TaskTracker node is housed ("rack local"). When, however, the TaskTracker has more free slots than the number of pending tasks, then the TaskTracker will be assigned all pending tasks, regardless of where the data for those tasks is located.If a job is submitted that will process a single input split, that job will contain a single map task. As that single map task is the only pending task, the next TaskTracker that sends a heartbeat message will be assigned the task, regardless of data locality. For an explanation of the task execution process see Anatomy of a MapReduce Job Run with Hadoop. For an explanation of how input splits are computed and how they relate to map tasks, see chapter 2 in Hadoop: The Definitive Guide, 3rd Edition in the Scaling Out: Data Flow section.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
1. Combiner does the local aggregation of data, thereby allowing the number of mappers to process input data faster. 2. Combiner does the local aggregation of data, thereby reducing the number of mappers that need to run. 3. Access Mostly Uused Products by 50000+ Subscribers
1. Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file 2. Binary data cannot be used by Hadoop fremework. Binary data should be converted to a Hadoop compatible format prior to loading 3. Access Mostly Uused Products by 50000+ Subscribers 4. Hadoop can freely use binary files with map-reduce jobs so long as the files have headers