Mapr (HP) Hadoop Developer Certification Questions and Answers (Dumps and Practice Questions)

Question : What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?

1. You will not be able to compress the intermediate data.

2. You will longer be able to take advantage of a Combiner.

3. By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.

4. There are no concerns with this approach. It is always advisable to use multiple reduces.

Correct Answer : 3

Explanation: Multiple reducers and total ordering If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than
1, or because you've used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the
reducers. Use of the HashPartitioner means that you can't concatenate your output files to create a single sorted output file. To do this you'll need total ordering,

Question : You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such
characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters. Identify the number
of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4:

1. You will have forty-eight failed task attempts

2. You will have seventeen failed task attempts

3. You will have five failed task attempts

4. You will have twelve failed task attempts

5. You will have twenty failed task attempts

Correct Answer : 5
Explanation: There will be four failed task attempts for each of the five file splits.

Question : To process input key-value pairs, your mapper needs to lead a MB data file in memory. What is the best way to accomplish this?

1. Serialize the data file, insert in it the JobConf object, and read the data into memory in the setup method of the mapper.

2. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.

3. Place the data file in the DataCache and read the data into memory in the configure method of the mapper.

4. Place the data file in the DistributedCache and read the data into memory in the setup method of the mapper.

Correct Answer : 1
Explanation: DistributedCache (It is deprecated) is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your
job, hadoop
framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file
in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.

Job job = Job.getInstance(conf, "HELLO");
FileStatus[] fileStatus = fs.listStatus(new Path("/user/hadoop/data/"));
for (FileStatus f : fileStatus)
{
job.addCacheFile(new URI(f.getPath().toUri().toString() + "#_" + f.getPath().getName()));
}

The new API to retrieve cached files is "context.getCacheFiles()" while the "context.getLocalCacheFiles()" is deprecated.
However, context.getCacheFiles() returns a URI array which
each element is a HDFS path(hdfs://master:50011/user/hadoop/...),
while context.getLocalCacheFiles() returns a Path array which each element is a local path (//hadoop/localdirs/usercache/...)

Related Questions

Question : When a JobTracker stops receiving heartbeats from the task tracker, the JobTracker

1. Retry three times to schedule the task on same task tracker

2. Reschedules the tasks on failed TaskTracker to other TaskTracker

3. Report Failures and stops

4. Restarts the failed TaskTracker

Question : Which is the default scheduler in YARN architecture

1. Fair Scheduler

2. Capacity Scheduler

3. Map Scheduler

4. Hash scheduler

Question : Which of the following is a correct pattern to use Hadoop framework

1. Summarizing the data for instance statistical summaries, counts, and indexes for group of data.
2. Filtering data for instance Sample, sanitize, identify top n , and filter unique data
3. Optimize data for instance Transform, Partition, sort and generate data

1. 1,2

2. 2,3

3. 1,3

4. 1,2 and 3

Question : The four arguments to the Mapper class represents the input key type, input value type , output key type, and output value type

1. True
2. False

Question : If the output value of the Mapper is Text, the input value to the Reducer can be LongWritable

1. True
2. False

Question : MapReduce job can be launched both Synchronously or Asynchronously

1. True
2. False