Question : Because of OutOfMemory a Map or Reduce Task can crash, How does Hadoop MapReduce v (MRv) handle JVMs when a new MapReduce job is started on a cluster? 1. The TaskTracker may or may not use same JVM for each task it manages on that node 2. The TaskTracker reuse the same JVM for each task it manages on that node 3. Access Mostly Uused Products by 50000+ Subscribers 4. The TaskTracker spawns a new JVM for each task it manages on that node
Explanation: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.After a TaskTracker has been assigned a task, it creates a TaskRunner instance to manage the task. The TaskRunner prepares the environment and launches a new JVM for the task. This process happens for every task assigned to a TaskTracker.
Watch the module : 3 to 8 from http://hadoopexam.com/index.html/#hadoop-training
Question : You have configured Hadoop cluster with MR. And, you have a directory called HadoopExam in HDFS containing three files: Exam and Exam. You submit a job to the cluster, using that directory as the input directory. A few seconds after you have submitted the job, a user start copying a large file, Exam3, into the directory. Select the correct statement? 1. All files Exam1, Exam2 and Exam3 will be processed by the job 2. Only files Exam1, and Exam2 will be processed by the job 3. Access Mostly Uused Products by 50000+ Subscribers 4. Only files Exam3 will be processed by the job
Explanation: The way HDFS has been set up, it breaks down very large files into large blocks (for example, measuring 128MB), and stores three copies of these blocks on different nodes in the cluster. HDFS has no awareness of the content of these files.In YARN, when a MapReduce job is started, the Resource Manager (the cluster resource management and job scheduling facility) creates an Application Master daemon to look after the lifecycle of the job. (In Hadoop 1, the JobTracker monitored individual jobs as well as handling job scheduling and cluster resource management.) One of the first things the Application Master does is determine which file blocks are needed for processing. The Application Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application Master makes requests to the Resource Manager to have map tasks process specific blocks on the slave nodes where they're stored. The key to efficient MapReduce processing is that, wherever possible, data is processed locally - on the slave node where it's stored.
Before looking at how the data blocks are processed, you need to look more closely at how Hadoop stores data. In Hadoop, files are composed of individual records, which are ultimately processed one-by-one by mapper tasks. For example, the sample data set contains information about completed flights within the United States between 1987 and 2008. To download the sample data set, open the Firefox browser from within the VM, and go to the dataexpo page. You have one large file for each year, and within every file, each individual line represents a single flight. In other words, one line represents one record. Now, remember that the block size for the Hadoop cluster is 64MB, which means that the light data files are broken into chunks of exactly 64MB. Do you see the problem? If each map task processes all records in a specific data block, what happens to those records that span block boundaries? File blocks are exactly 64MB (or whatever you set the block size to be), and because HDFS has no conception of what's inside the file blocks, it can't gauge when a record might spill over into another block. To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job client calculates the input splits, it figures out where the first whole record in a block begins and where the last record in the block ends. In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.When a job is submitted, the client calculates the InputSplits and passes that information to the JobTracker. The JobTracker then creates Map tasks to process each of those InputSplits. Any more data added to the directory as the job is running will be ignored, as the JobTracker never re-scans the directory. For details on the anatomy of a MapReduce job run, see Hadoop: The Definitive Guide, 3rd Edition, page 190.
Watch the module : 3 to 8 from http://hadoopexam.com/index.html/#hadoop-training
Question : As you know Hadoop cluster is made of Multiple nodes and each file is divided in multiple blocks and stored in different nodes. For this you need to able to serialize your data and you use Writable interface for this, select the correct statement for the Writable interface. 1. Writable is a class that all keys and values in MapReduce must extend. Classes extending this interface must implement methods for serializing and deserializing themselves. 2. Writable is an class that all keys and values in MapReduce must extend. Classes extending this interface must not implement methods for serializing and deserializing themselves until they want to customize it. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Writable is an interface that all values in MapReduce must implement. Classes implementing this interface must implement methods for serializing and deserializing themselves.
Explanation: A serializable object which implements a simple, efficient, serialization protocol, based on DataInput and DataOutput. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.Writable is an interface that defines a way for objects to serialize themselves to and deserialize themselves from a stream in Hadoop. (It is roughly analgous to the java.io.Externalizable interface in the core Java API.) All keys and values emitted by map or reduce tasks must implement Writable. For more information, see the Yahoo! Developer Network Apache Hadoop Tutorial in the Custom Data Types section. What is a Writable in Hadoop? If you have gone through the "Hello World" of MapReduce post, or any other Hadoop program, you must have seen data types different from regular Java defined data types. In wordCount post, you must have seen LongWritable, IntWrtitable and Text. It is fairly easy to understand the relation between them and Java's primitive types. LongWritable is equivalent to long, IntWritable to int and Text to String. Any value in Hadoop must be Writable. A Writable in an interface in Hadoop and types in Hadoop must implement this interface. Hadoop provides these writable wrappers for almost all Java primitive types and some other types. Now the obvious question is why does Hadoop use these types instead of Java types? Why does Hadoop use Writable(s)? As we already know, data needs to be transmitted between different nodes in a distributed computing environment. This requires serialization and deserialization of data to convert the data that is in structured format to byte stream and vice-versa. Hadoop therefore uses simple and efficient serialization protocol to serialize data between map and reduce phase and these are called Writable(s). Some of the examples of writables as already mentioned before are IntWritable, LongWritable, BooleanWritable and FloatWritable. The entire list is in org.apache.hadoop.io package of the Hadoop Source (http://hadoop.apache.org/docs/current/api/index.html).
Limitation of primitive Hadoop Writable classes In the wordCount example we emit Text as the key and IntWritable as the value from the Mappers and Reducers. Although Hadoop provides many primitive Writable that can be used in simple applications like wordcount, but clearly these cannot serve our purpose all the time.
Watch the module : 10 from http://hadoopexam.com/index.html/#hadoop-training