Most of the scripting languages like php, python, perl, ruby bash is good. Any language able to read from stdin, write to sdtout and parse tab and new line characters will work: Hadoop Streaming just pipes the string representations of key value pairs as concatenated with a tab to an arbitrary program that must be executable on each task tracker node.
And Java is obviously true.
Heart Beat of child nodes will not be sent to the parent nodes when we are using scripting languages
Question :
How does Hadoop process large volumes of data?
1. Hadoop uses a lot of machines in parallel. This optimizes data processing. 2. Hadoop was specifically designed to process large amount of data by taking advantage of MPP hardware 3. Hadoop ships the code to the data instead of sending the data to the code 4. Hadoop uses sophisticated cacheing techniques on namenode to speed processing of data
The basic design principles of Hadoop is to eliminate the data copying between different datanodes
Refer HadoopExam.com Recorded Training Module : 2 and 3
Question :
What are sequence files and why are they important?
1. Sequence files are binary format files that are compressed and are splitable. They are often used in high-performance map-reduce jobs 2. Sequence files are a type of the file in the Hadoop framework that allow data to be sorted 3. Sequence files are intermediate files that are created by Hadoop after the map step 4. All of above
Explanation: Hadoop is able to split data between different nodes gracefully while keeping data compressed. The sequence files have special markers that allow data to be split across entire cluster
The sequence file format supported by Hadoop breaks a file into blocks and then optionally compresses the blocks in a splittable way
It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile. The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.
1. Hadoop attempts that Mappers run on node which hold their portio of data locally. 2. Multiple Mappers run parallely 3. Access Mostly Uused Products by 50000+ Subscribers 4. 1 and 2 are correct 5. 1,2 and 3 are correct