Premium

Mapr (HP) Hadoop Developer Certification Questions and Answers (Dumps and Practice Questions)



Question : Your clusters HDFS block size in MB . You have directory containing plain text files, each of which is MB in size. The InputFormat for your job is
TextInputFormat. Determine how many Mappers will run?

 : Your clusters HDFS block size in MB . You have directory containing  plain text files, each of which is MB in size. The InputFormat for your job is
1. 64

2. 100

3. 200

4. 640

Correct Answer : 3


Explanation: Each file would be split into two as the block size (64 MB) is less than the file size (100 MB), so 200 mappers would be running.
Note: If you're not compressing the files then hadoop will process your large files (say 10G), with a number of mappers related to the block size of the
file. Say your block size is
64M, then you will have ~160 mappers processing this 10G file (160*64 ~= 10G). Depending on how CPU intensive your mapper logic is, this might be an
acceptable blocks size, but if
you find that your mappers are executing in sub minute times, then you might want to increase the work done by each mapper (by increasing the block size to 128, 256, 512m - the
actual size depends on how you intend to process the data).




Question : Which process describes the lifecycle of a Mapper?

 : Which process describes the lifecycle of a Mapper?
1. The JobTracker calls the TaskTracker's configure () method, then its map () method and finally its close () method.

2. The TaskTracker spawns a new Mapper to process all records in a single input split.

3. The TaskTracker spawns a new Mapper to process each key-value pair.

4. The JobTracker spawns a new Mapper to process all records in a single file.

Correct Answer : 2


Explanation: For each map instance that runs, the TaskTracker creates a new instance of your mapper.
Note:
* The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat. The mapper may perform a number of Extraction and Transformation functions on the Key/Value
pair before ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Value type.
* With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class. This class defines an 'Identity' map function by default - every
input Key/Value pair obtained
from the InputFormat is written out.




Question : What is a SequenceFile?


 : What is a SequenceFile?
1. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable objects.

2. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable objects.

3. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.

4. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type.


Correct Answer : 4


Explanation: Explanation: SequenceFile is a flat file consisting of binary key/value pairs There are 3 different SequenceFile formats:
Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.


Related Questions


Question : A combiner reduces:

 : A combiner reduces:
1. The number of values across different keys in the iterator supplied to a single reduce method call.

2. The amount of intermediate data that must be transferred between the mapper and reducer.

3. The number of input files a mapper must process.

4. The number of output files a reducer must produce.


Question : Which two of the following statements are true about hdfs? Choose answers

A. An HDFS file that is larger than dfs.blocksize is split into blocks
B. Blocks are replicated to multiple datanodes
C. HDFS works best when storing a large number of relatively small files
D. Block sizes for all files must be the same size
 : Which two of the following statements are true about hdfs? Choose  answers
1. A,B
2. B,C
3. C,D
4. A,D


Question : You want to populate an associative array in order to perform a map-side join. You've decided to put this information in a text file, place that file into the
DistributedCache and read it in your Mapper before any records are processed. Identify which method in the Mapper you should use to implement code for reading the file and
populating the associative array?

 : You want to populate an associative array in order to perform a map-side join. You've decided to put this information in a text file, place that file into the
1. combine

2. map

3. init

4. configure


Question : What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?

 : What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?
1. You will not be able to compress the intermediate data.

2. You will longer be able to take advantage of a Combiner.

3. By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.

4. There are no concerns with this approach. It is always advisable to use multiple reduces.


Question : You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such
characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters. Identify the number
of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4:

 : You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such
1. You will have forty-eight failed task attempts

2. You will have seventeen failed task attempts

3. You will have five failed task attempts

4. You will have twelve failed task attempts

5. You will have twenty failed task attempts


Question : To process input key-value pairs, your mapper needs to lead a MB data file in memory. What is the best way to accomplish this?

 : To process input key-value pairs, your mapper needs to lead a  MB data file in memory. What is the best way to accomplish this?
1. Serialize the data file, insert in it the JobConf object, and read the data into memory in the setup method of the mapper.

2. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.

3. Place the data file in the DataCache and read the data into memory in the configure method of the mapper.

4. Place the data file in the DistributedCache and read the data into memory in the setup method of the mapper.