Premium

Mapr (HP) Hadoop Developer Certification Questions and Answers (Dumps and Practice Questions)



Question : You write MapReduce job to process files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits
key values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces
to one and settings the number of reducers to zero.
 : You write MapReduce job to process  files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits
1. There is no difference in output between the two settings.

2. With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS.

3. With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.

4. With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.


Correct Answer : 4


Explanation: It is ok to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the
map-outputs before writing
them out to the FileSystem.

Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce framework will not
create any reducer
tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

Note: Reduce
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each pair in the
grouped inputs.
The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(WritableComparable, Writable).
Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The
output of the Reducer is
not sorted.




Question : In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map
task processes
each input file regardless of how many blocks the input file occupies?

 : In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map
1. Increase the parameter that controls minimum split size in the job configuration.

2. Write a custom MapRunner that iterates over all key-value pairs in the entire file.

3. Set the number of mappers equal to the number of input files you want to process.

4. Write a custom FileInputFormat and override the method isSplitable to always return false.

Correct Answer : 4


Explanation: FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobContext). Subclasses of FileInputFormat can also override the isSplitable(JobContext, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.





Question : What is the term for the process of moving map outputs to the reducers?

 : What is the term for the process of moving map outputs to the reducers?
1. Reducing

2. Combining

3. Partitioning

4. Shuffling and sorting

Correct Answer : 4


Explanation: First of all shuffling is the process of transferring data from the mappers to the reducers, so I think it is obvious that it is necessary for the
reducers, since otherwise,
they wouldn't be able to have any input (or input from every mapper).

Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key
in the sorted input data
is different than the previous, to put it simply.

Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase,
will be sent.

Note that a reducer is different than a reduce task. A reducer can run multiple reduce tasks. Note also that shuffling and sorting are performed locally, by
each reducer, for its own
input data, whereas partitioning is not local.


Related Questions


Question : Which one of the following is NOT a valid Oozie action?

 : Which one of the following is NOT a valid Oozie action?
1. mapreduce

2. pig

3. hive

4. mrunit



Question : You want to count the number of occurrences for each unique word in the supplied input data. You've decided to implement this by having your
mapper tokenize each word and
emit a literal value 1, and then have your reducer increment a counter for each literal 1 it receives. After successful implementing this, it occurs to you
that you could optimize
this by specifying a combiner. Will you be able to reuse your existing Reduces as your combiner in this case and why or why not?

 : You want to count the number of occurrences for each unique word in the supplied input data. You've decided to implement this by having your
1. Yes, because the sum operation is both associative and commutative and the input and output types to the reduce method match.

2. No, because the sum operation in the reducer is incompatible with the operation of a Combiner.

3. No, because the Reducer and Combiner are separate interfaces.

4. No, because the Combiner is incompatible with a mapper which doesn't use the same data type for both the key and value.

5. Yes, because Java is a polymorphic object-oriented language and thus reducer code can be reused as a combiner.


Question : Workflows expressed in Oozie can contain:
 : Workflows expressed in Oozie can contain:
1. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.

2. Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.

3. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks.

4. Iterative repetition of MapReduce jobs until a desired answer or state is reached.


Question : Your clusters HDFS block size in MB . You have directory containing plain text files, each of which is MB in size. The InputFormat for your job is
TextInputFormat. Determine how many Mappers will run?

 : Your clusters HDFS block size in MB . You have directory containing  plain text files, each of which is MB in size. The InputFormat for your job is
1. 64

2. 100

3. 200

4. 640


Question : Which process describes the lifecycle of a Mapper?

 : Which process describes the lifecycle of a Mapper?
1. The JobTracker calls the TaskTracker's configure () method, then its map () method and finally its close () method.

2. The TaskTracker spawns a new Mapper to process all records in a single input split.

3. The TaskTracker spawns a new Mapper to process each key-value pair.

4. The JobTracker spawns a new Mapper to process all records in a single file.


Question : What is a SequenceFile?


 : What is a SequenceFile?
1. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable objects.

2. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable objects.

3. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.

4. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type.