Question : What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?
1. You will not be able to compress the intermediate data.
2. You will longer be able to take advantage of a Combiner.
3. By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
4. There are no concerns with this approach. It is always advisable to use multiple reduces.
Correct Answer : 3
Explanation: Multiple reducers and total ordering If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because you've used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can't concatenate your output files to create a single sorted output file. To do this you'll need total ordering,
Question : You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters. Identify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4:
1. You will have forty-eight failed task attempts
2. You will have seventeen failed task attempts
3. You will have five failed task attempts
4. You will have twelve failed task attempts
5. You will have twenty failed task attempts
Correct Answer : 5 Explanation: There will be four failed task attempts for each of the five file splits.
Question : To process input key-value pairs, your mapper needs to lead a MB data file in memory. What is the best way to accomplish this?
1. Serialize the data file, insert in it the JobConf object, and read the data into memory in the setup method of the mapper.
2. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
3. Place the data file in the DataCache and read the data into memory in the configure method of the mapper.
4. Place the data file in the DistributedCache and read the data into memory in the setup method of the mapper.
Correct Answer : 1 Explanation: DistributedCache (It is deprecated) is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.
The new API to retrieve cached files is "context.getCacheFiles()" while the "context.getLocalCacheFiles()" is deprecated. However, context.getCacheFiles() returns a URI array which each element is a HDFS path(hdfs://master:50011/user/hadoop/...), while context.getLocalCacheFiles() returns a Path array which each element is a local path (//hadoop/localdirs/usercache/...)