Question : Your clusters HDFS block size in MB . You have directory containing plain text files, each of which is MB in size. The InputFormat for your job is TextInputFormat. Determine how many Mappers will run?
1. 64
2. 100
3. 200
4. 640
Correct Answer : 3
Explanation: Each file would be split into two as the block size (64 MB) is less than the file size (100 MB), so 200 mappers would be running. Note: If you're not compressing the files then hadoop will process your large files (say 10G), with a number of mappers related to the block size of the file. Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~= 10G). Depending on how CPU intensive your mapper logic is, this might be an acceptable blocks size, but if you find that your mappers are executing in sub minute times, then you might want to increase the work done by each mapper (by increasing the block size to 128, 256, 512m - the actual size depends on how you intend to process the data).
Question : Which process describes the lifecycle of a Mapper?
1. The JobTracker calls the TaskTracker's configure () method, then its map () method and finally its close () method.
2. The TaskTracker spawns a new Mapper to process all records in a single input split.
3. The TaskTracker spawns a new Mapper to process each key-value pair.
4. The JobTracker spawns a new Mapper to process all records in a single file.
Correct Answer : 2
Explanation: For each map instance that runs, the TaskTracker creates a new instance of your mapper. Note: * The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat. The mapper may perform a number of Extraction and Transformation functions on the Key/Value pair before ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Value type. * With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class. This class defines an 'Identity' map function by default - every input Key/Value pair obtained from the InputFormat is written out.
Question : What is a SequenceFile?
1. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable objects.
2. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable objects.
3. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
4. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type.
Correct Answer : 4
Explanation: Explanation: SequenceFile is a flat file consisting of binary key/value pairs There are 3 different SequenceFile formats: Uncompressed key/value records. Record compressed key/value records - only 'values' are compressed here. Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.