Correct Answer : Get Lastest Questions and Answer : Explanation: Spilling map output to the disk multiple times (before the final spill) can lead to additional overhead of reading and merging of the spilled records. From "Hadoop the definitive guide" [Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by default, a size that can be tuned by changing the io.sort.mb property. When the contents of the buffer reaches a certain threshold size (io.sort.spill.percent, which has the default 0.80, or 80%), a background thread will start to spill the contents to disk]
If possible eliminate all intermediate spills and just spill the final output
Question : Select correct statement regarding Circular buffer and spilling of these buffers
1. When circular buffer reaches 80% (or any configured size). It will first sent data to sort by key, if combiner is configured, it will also executed.
2. By default 10 Spills can again merge, after spills
Correct Answer : Get Lastest Questions and Answer : Explanation: Each Map task outputs data in the form of Key/Value pair. mapreduce.tasktracker.map.tasks.maximum: 8 The maximum number of map tasks that will be run simultaneously by a task tracker mapreduce.map.memory.mb: 128 The amount of memory to request from the scheduler for each map task. The output is stored in a Ring Buffer rather than being written directly to the disk. When the Ring Buffer reaches 80% capacity, the content is "spilled" to disk. This process will create multiple files on the datanode (shuffle spill files). mapreduce.map.sort.spill.percent: 0.80 The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5 Hadoop will merge all the spill files on a given datanode into a single file This single file is both sorted and partitioned based on number of reducers. mapreduce.task.io.sort.mb: 512 The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. mapreduce.task.io.sort.factor: 64 The number of streams to merge at once while sorting files. This determines the number of open file handles. mapreduce.reduce.shuffle.input.buffer.percent: 0.70 The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. mapreduce.reduce.input.buffer.percent: 0.70 The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. mapreduce.reduce.shuffle.parallelcopies: 128 The default number of parallel transfers run by reduce during the copy(shuffle) phase. mapreduce.reduce.memory.mb: 1024 The amount of memory to request from the scheduler for each reduce task. mapreduce.reduce.shuffle.merge.percent: 0.66 The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.
Circular Buffer
The Ring Buffer (aka Circular Buffer) is a key concept in the MapReduce ecosystem.
We have two major challenges in any map/reduce program:
We are dealing with a massive amount of data If this isn't true, we don't need to use map/reduce The result of the map tasks can not be constantly written to disk This would be too slow Nor can it be stored entirely within memory Most systems would not have a sufficient amount of memory
We have to use a combination of disks/memory efficiently.
The circular buffer is fast. Writing to memory is much faster than doing an I/O to disk. Flushing the data is only performed when needed.
Continuous logging can fill up space on the systems, causing other programs to also run out of space and fail. In such cases, either logs have to be manually removed or a log rotation policy has to be implemented.
Question : Please map the followings
A. mapred.map.child.java.opts B. mapred.reduce.child.java.opts C. mapred.child.java.opts D. mapred.child.ulimit