Question : You've written a MapReduce job based on HadoopExam websites log file named MAIN.PROFILE.log file , resulting in an extremely large amount of output data. Which of the following cluster resources will your job stress? ? 1. network I/O and disk I/O 2. network I/O and RAM 3. Access Mostly Uused Products by 50000+ Subscribers 4. RAM , network I/O and disk I/O
Correct Answer : Get Lastest Questions and Answer : Explanation: MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all our nodes. Running a MapReduce program involves running mapping tasks on many or all of the nodes in our cluster. Each of these mapping tasks is equivalent: no mappers have particular "identities" associated with them. Therefore, any mapper can process any input file. Each mapper loads the set of files local to that machine and processes them. When the mapping phase has completed, the intermediate (key, value) pairs must be exchanged between machines to send all values with the same key to a single reducer. The reduce tasks are spread across the same nodes in the cluster as the mappers. This is the only communication step in MapReduce. Individual map tasks do not exchange information with one another, nor are they aware of one another's existence. Similarly, different reduce tasks do not communicate with one another. The user never explicitly marshals information from one machine to another; all data transfer is handled by the Hadoop MapReduce platform itself, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce's reliability. If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully.When denormalizing a table, the amount of data written by the map phase will far exceed the amount of data read by the map phase. All of the data written during the map phase is first written to local disk and then transferred over the network to the reducers during the beginning of the reduce phase. Writing a very large amount of data in the map phase will therefore create a large amount of local disk I/O on the machines running map tasks and network I/O. Because map output is stored in a fixed size buffer that is written periodically to disk, this operation will not tax the memory of the machines running the map tasks. Denormalizing a table is not a compute-intesive operation, so this operation will not tax the processors of the machines running the map tasks. Partition & Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data. The Partitioner class determines which partition a given (key, value) pair will go to. The default partitioner computes a hash value for the key and assigns the partition based on this result. Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a
single node is automatically sorted by Hadoop before they are presented to the Reducer. Reduce: A Reducer instance is created for each reduce task. This is an instance of user-provided code that performs the second important phase of job-specific work. For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once. This receives a key as well as an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The Reducer also receives as parameters OutputCollector and Reporter objects; they are used in the same manner as in the map() method.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : You have written a Mapper which invokes the following five calls to the OutputCollector.collect method:
output.collect(new Text("Flag"), new Text("Rahul")); output.collect(new Text("Shirt"), new Text("Yakul")); output.collect(new Text("Shoe"), new Text("Rahul")); output.collect(new Text("Flag"), new Text("Gemini")); output.collect(new Text("Socks"), new Text("Yakul"));
How many times will the Reducer's reduce() method be invoked.
Explanation: Keys and values: In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values. For example, a log of time-coded speedometer readings from multiple cars could be keyed by license-plate number; it would look like:
AAA-123 65mph, 12:00pm ZZZ-789 50mph, 12:02pm AAA-123 40mph, 12:05pm CCC-456 25mph, 12:15pm ... The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same: both a key and a value must be emitted to the next list in the data flow. MapReduce is also less strict than other languages about how the Mapper and Reducer work. In more formal functional mapping and reducing settings, a mapper must produce exactly one output element for each input element, and a reducer must produce exactly one output element for each input list. In MapReduce, an arbitrary number of values can be output from each phase; a mapper may map one input into zero, one, or one hundred outputs. A reducer may compute over an input list and emit one or a dozen different outputs.After the shuffle and sort phase there will be 4 calls to the reduce method, one being passed "Flag" as the key and "Rahul, Gemini" as values, one bing passed "Shirt" as the key and "Yakul" as the value, one being passed "Shoe" as the key and "Rahul" as the value and one being passed "Socks" as the key and "Yakul" as the value.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : ___________ is an optimization technique where a computer system performs some task that may not be actually needed. The main idea is to do work before it is known whether that work will be needed at all, so as to prevent a delay that would have to be incurred by doing the work after it is known whether it is needed. If it turns out the work was not needed after all, the results are ignored. The Hadoop framework also provides a mechanism to handle machine issues such as faulty configuration or hardware failure. The JobTracker detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. This behaviour is known as ________________
Correct Answer : Get Lastest Questions and Answer : Explanation: Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes. By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.The MapReduce framework monitors the progress of tasks. When one or more tasks are not making sufficiently fast progress, additional copies of the tasks will be speculaively executed. For additional information on speculative execution, see chapter 4 in Hadoop: The Definitive Guide, 3rd Edition in the Task Execution: Speculative Execution section.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
1. Four, all files will be processed 2. Three, the pound sign is an invalid character for HDFS file names 3. Access Mostly Uused Products by 50000+ Subscribers 4. None, the directory cannot be named jobdata 5. One, no special characters can prefix the name of an input file
1. Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job. What else is the requirement of the Class using using this libjars. 2. Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Package your code and the Apache Commands Math library into a zip file named JobJar.zip