Question : While processing the MAIN.PROFILE.log generated in the Apache WebServer of the QuickTechie.com website using MapReduce job. There are 100 nodes in the cluster and 3 reducers defined. Which of the reduce tasks will process a Text key which begins with the regular expression "\w+"? 1. First Reducer will process the key, wich satisfies the regular expression "\w+" 2. Second Reducer will process the key, wich satisfies the regular expression "\w+" 3. Access Mostly Uused Products by 50000+ Subscribers 4. Not enough data to determine which reduce task will receive which key
Correct Answer : Get Lastest Questions and Answer : Explanation: Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Overall, Mapper implementations are passed the JobConf for the job via the JobConfigurable.configure(JobConf) method and override it to initialize themselves. The framework then calls map(WritableComparable, Writable, OutputCollector, Reporter) for each key/value pair in the InputSplit for that task. Applications can then override the Closeable.close() method to perform any required cleanup. Output pairs do not need to be of the same types as input pairs. A given input pair may map to zero or many output pairs. Output pairs are collected with calls to OutputCollector.collect(WritableComparable,Writable). Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive. All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output. Users can control the grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class). The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner. Users can optionally specify a combiner, via JobConf.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer. The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format. Applications can control if, and how, the intermediate outputs are to be compressed and the CompressionCodec to be used via the JobConf. How Many Maps? The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute. Thus, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, unless setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher. Reducer : Reducer reduces a set of intermediate values which share a key to a smaller set of values. The number of reduces for the job is set by the user via JobConf.setNumReduceTasks(int). Overall, Reducer implementations are passed the JobConf for the job via the JobConfigurable.configure(JobConf) method and can override it to initialize themselves. The framework then calls reduce(WritableComparable, Iterator, OutputCollector, Reporter) method for each (key, (list of values)> pair in the grouped inputs. Applications can then override the Closeable.close() method to perform any required cleanup. Reducer has 3 primary phases: shuffle, sort and reduce. Shuffle : Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. Sort : The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged. Secondary Sort : If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values. Reduce : In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each (key, (list of values)> pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(WritableComparable, Writable). Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The output of the Reducer is not sorted. How Many Reduces? The right number of reduces seems to be 0.95 or 1.75 multiplied by ((no. of nodes> * mapred.tasktracker.reduce.tasks.maximum). With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks. Reducer NONE : It is legal to set the number of reduce-tasks to zero if no reduction is desired. In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.When there is more than one reducer, the map tasks partition their output among the reducers using a partitioning function. By default the partitioning function uses the hash code of the key to identify the partition, but the partitioning function can be overridden with a user-defined partitioning function. Either way, there is not sufficient data to determine which reducer will process the record. For more information, see chapter 2 of Hadoop: The Definitive Guide, 3rd Edition in the Scaling Out: Data Flow section.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : To process the www.HadoopExam.com MAIN.PROFILE.log file You submit a job to a cluster running on MRv. There are 1000 slave nodes in a 100 rack, You have NOT specified a rack topology script. Your job has a single Reducer which runs on Node7 of Rack7. The output file it writes is small enough to fit in a single HDFS block. How does Hadoop handle writing the output file? 1. The first replica of the block will be stored in any node out of 1000 nodes. 2. The first replica of the block will be stored on node7 of Rack7 only. The other two replicas will be stored on other nodes in any rack. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The first replica of the block will be stored on node7 in rack7. The other two replicas will be stored on node6 and node8 in rack7
Explanation: How does it is decided that on which datanode the replica of a block will be kept ?
Well there is a trade-off between the reliability and read and write bandwidth here. Let's suppose we have kept the replication factor as 1, so in this case all the blocks of any data will be kept only once on any machine and if that machine goes down we won't be able to retrieve the data back. So here we got a reliability problem. But again placing all replicas on a single node incurs the lowest write bandwidth penalty which is good.
Now let's say we have kept the replication factor as 5 so in this case if one, two or three of the node goes down we can still get the data blocks. And also while reading the data there is more chance of data locality or data closeness of that datanode with the client. So the data retrieval will be fast, But the write bandwidth will be high in this case, and also the data redundancy will be more.
As we see here that there is a trade off between reliability and read and write bandwidth, In 95% of the cases we keep the replication factor as three which is suitable for most of the use case scenarios in production environment.
Again Hadoop's default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.Hadoop's block placement policy is such that the first copy of any block created by a client running on the cluster (as the Reducer is in this case) will be written on the node running that client. The other two replicas will be stored on other nodes in the cluster. Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question :
Let's assume you have following files in the hdfs directory called merge. Test1.txt hadoopexam.com Hadoop Training 1
Test2.txt www.hadoopexam.com Hadoop YARN Training
Test3.txt http://hadoopexam.com Amazon WebService Training
Now you run the following command hadoop fs -getmerge merge/ output1.txt What is the correct statement?
1. It will create a new file called output1.txt in local file system, with the merged content from the all three files 2. It will create a new file called output1.txt in hdfs file system, with the merged content from the all three files 3. Access Mostly Uused Products by 50000+ Subscribers 4. This command will successful but will not merge the files because of, what to do with new line character is not defined.
getmerge Usage: hadoop fs -getmerge (src) (localdst) [addnl] Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.