Question : In our website www.HadoopExam.com we have Million profiles and created ETL jobs for processing this file. You have submitted a ETL mapReduce job for HadoopExam.com websites log file analysis as well as combining profile data to Hadoop and notice in the JobTracker's Web UI that the Mappers are 80% complete while the reducers are 20% complete. What is the best explanation for this? 1. The progress attributed to the reducer refers to the transfer of data from completed Mappers. 2. The progress attributed to the reducer refers to the transfer of data from Mappers is still going on. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The progress attributed to the reducer refers to the transfer of data from Mappers an not be predicted.
Explanation: While the reduce() method is not called until all of the mappers have completed, transfer of data from completed mappers starts prior to all of the mappers having completed. Themapred.reduce.slowstart.completed.maps property specifies the percentage of mappers that must complete before the reducers can start receiving data from the completed mappers. Reducer has 3 primary phases: Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network. Sort : The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged. SecondarySort : To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled by Job.setSortComparatorClass(Class). For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like: Map Input Key: url, Map Input Value: document , Map Output Key: document checksum, url pagerank , Map Output Value: url , Partitioner: by checksum , OutputKeyComparator: by checksum and then decreasing pagerank , OutputValueGroupingComparator: by checksum Reduce : No reduce task's reduce() method is called until all map tasks have completed. Every reduce task'sreduce() method expects to receive its data in sorted order. In this phase the reduce(Object, Iterable, Context) method is called for each (key, (collection of values)) in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not re-sorted. If the reduce() method is called before all of the map tasks have completed, it would be possible that the reduce() method would receive the data out of order. For more information about the shuffle and sort phase
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : In your MapReduce job, you have three configuration parameters. What is the correct or best way to pass a these three configuration parameters to a mapper or reducer? 1. As key pairs in the Configuration object. 2. As value pairs in the Configuration object. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Not possible
Explanation: Unless I'm missing something, if you have a Properties object containing every property you need in your M/R job, you simply need to write the content of the Properties object to the Hadoop Configuration object. For example, something like this:
Configuration conf = new Configuration(); Properties params = getParameters(); // do whatever you need here to create your object for (Entry < Object, Object > entry : params.entrySet()) { String propName = (String)entry.getKey(); String propValue = (String)entry.getValue(); conf.set(propName, propValue); }
Then inside your M/R job, you can use the Context object to get back your Configuration in both the mapper (the map function) or the reducer (the reduce function), like this:
Note that when using the Configuration object, you can also access the Context in the setup and cleanup methods, useful to do some initialization if needed.
Also it's worth mentioning you could probably directly call the addResource method from the Configuration object to add your properties directly as an InputStream or a file, but I believe this has to be an XML configuration like the regular Hadoop XML configs, so that might just be overkill.
In case of non-String objects, I would advise using serialization: You can serialize your objects, and then convert them to Strings (probably encode them for example with Base64 as I'm not sure what would happen if you have unusual characters), and then on the mapper/reducer side de-serialize the objects from the Strings you get from the properties inside Configuration.
Another approach would be to do the same serialization technique, but instead write to HDFS, and then add these files to the DistributedCache. Sounds a bit overkill, but this would probably work.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : In word count MapReduce algorithm, why might using a combiner (Combiner, runs after the Mapper and before the Reducer. ) reduce the overall job running time? 1. combiners perform local filtering of repeated word, thereby reducing the number of key-value pairs that need to be shuffled across the network to the reducers. 2. combiners perform global aggregation of word counts, thereby reducing the number of key-value pairs that need to be shuffled across the network to the reducers. 3. Access Mostly Uused Products by 50000+ Subscribers 4. combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be shuffled across the network to the reducers.
Explanation: Combiner: The pipeline showed earlier omits a processing step which can be used for optimizing bandwidth usage by your MapReduce job. Called the Combiner, this pass runs after the Mapper and before the Reducer. Usage of the Combiner is optional. If this pass is suitable for your job, instances of the Combiner class are run on every node that has run map tasks. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers. The Combiner is a "mini-reduce" process which operates only on data generated by one machine. Word count is a prime example for where a Combiner is useful. The Word Count program in listings 1--3 emits a (word, 1) pair for every instance of every word it sees. So if the same document contains the word "cat" 3 times, the pair ("cat", 1) is emitted three times; all of these are then sent to the Reducer. By using a Combiner, these can be condensed into a single ("cat", 3) pair to be sent to the Reducer. Now each node only sends a single value to the reducer for each word -- drastically reducing the total bandwidth required for the shuffle process, and speeding up the job. The best part of all is that we do not need to write any additional code to take advantage of this! If a reduce function is both commutative and associative, then it can be used as a Combiner as well. You can enable combining in the word count program by adding the following line to the driver:
conf.setCombinerClass(Reduce.class); The Combiner should be an instance of the Reducer interface. If your Reducer itself cannot be used directly as a Combiner because of commutativity or associativity, you might still be able to write a third class to use as a Combiner for your job.The only affect a combiner has is to reduce the number of records that are passed from the mappers to the reducers in the shuffle and sort phase. For more information on combiners, see chapter 2 of Hadoop: The Definitive Guide, 3rd Edition in the Scaling Out: Combiner Functions section.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
1. Increase the block size on all current files in HDFS. 2. Increase the block size on your remaining files. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Increase the amount of memory for the NameNode. 5. Increase the number of disks (or size) for the NameNode.
1. It returns a reference to a different Writable object time. 2. It returns a reference to a Writable object from an object pool. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a new object. 5. It returns a reference to the same Writable object if the next value is the same as the previous value, or a new Writable object otherwise.
Question : For each input key-value pair, mappers can emit: 1. As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous). 2. As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair. 3. Access Mostly Uused Products by 50000+ Subscribers 4. One intermediate key-value pair, but of the same type. 5. As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.