Explanation: Hadoop is comprised of five separate daemons
NameNode : Holds the metadata for HDFS
Secondary NameNode : Performs housekeeping functions for the NameNode - Is not a backup or hot standby for the NameNode DataNode : Stores actual HDFS data blocks
TaskTracker : Instantiates and monitors individual Map and Reduce tasks
Refer HadoopExam.com Recorded Training Module : 2 and 3
Question : You are using MapR Hadoop platform for Analyzing logs of WebServers you collected from Amazon Webservice. However, you are not able to decide which one of the Pig or MapReduce you should use for this analysis. Now cpnsidering this scenario you have been asked to find most accurately describes the relationship between MapReduce and Pig, Which opion you will see is more accurate?
1. Pig is more powerfull and allow certain types of data manipulation not possible with MapReduce. 2. Pig provides has same capabilities as MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Pig provides the additional capability of allowing you to control the flow of multiple MapReduce jobs and Chaining the MapReduce jobs, which is not possible only with the MapReduce.
Explanation: PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to process1, taking output and feeding it to process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing. While this can be done for any kind of processing tasks Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in Pig, some default ones are like ordering, grouping, distinct, count etc.
Map reduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that its easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Not all applications can be migrated to Map reduce but good few can be including complex ones like k-means to simple ones like counting uniques in a dataset.
Question :You have a WebSite called Quicktechie.com hosted on AWS (Amazon web service Cloud). You have been collecting your logs using Apache Flume and storing same in the HDFS file system. However, while storing logs, you do the followings. 1. Remove all new line character from each log files (Size of individiual log file is not less than 200MB). 2. Append all the log files from same server together, however while appending log files you add new line character between two log files. All the log files are in Text format and each newly created log file size (After appending) is not less than 100GB. You selected TextInputFormat in your MapReduce job to read the logs data for further processing. While log file processing, splits are representing data in HDFS. Select correct statement for split and new line character. ?
1. Input file splits may cross line breaks. A line that crosses tile splits is ignored. 2. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line. 5. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
Correct Answer : Get Lastest Questions and Answer : Explanation: FileInputFormat gives the following info: For each input file, get the file length, the block size and calculate the split size as max(minSize, min(maxSize, blockSize)) where maxSize corresponds to mapred.max.split.size and minSize is mapred.min.split.size. Divide the file into different FileSplits based on the split size calculated above. Whats important here is that each FileSplit is initialized with a start parameter corresponding to the offset in the input file. There is still no handling of the lines at that point. The relevant part of the code looks like this: while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); splits.add(new FileSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts())); bytesRemaining -= splitSize; } After that, if you look at the LineRecordReader which is defined by the TextInputFormat, thats where the lines are handled: When you initialize your LineRecordReader it tries to instantiate a LineReader which is an abstraction to be able to read lines over FSDataInputStream. There are 2 cases: If there is a CompressionCodec defined, then this codec is responsible for handling boundaries. Probably not relevant to your question. If there is no codec however, that's where things are interesting: if the start of your InputSplit is different than 0, then you backtrack 1 character and then skip the first line you encounter identified by \n or \r\n (Windows) ! The backtrack is important because in case your line boundaries are the same as split boundaries, this ensures you do not skip the valid line. Here is the relevant code: if (codec != null) { in = new LineReader(codec.createInputStream(fileIn), job); end = Long.MAX_VALUE; } else { if (start != 0) { skipFirstLine = true; --start; fileIn.seek(start); } in = new LineReader(fileIn, job); } if (skipFirstLine) { // skip first line and re-establish "start". start += in.readLine(new Text(), 0, (int)Math.min((long)Integer.MAX_VALUE, end - start)); } this.pos = start; So since the splits are calculated in the client, the mappers don't need to run in sequence, every mapper already knows if it neds to discard the first line or not. So basically if you have 2 lines of each 100Mb in the same file, and to simplify let's say the split size is 64Mb. Then when the input splits are calculated, we will have the following scenario: Split 1 containing the path and the hosts to this block. Initialized at start 200-200=0Mb, length 64Mb. Split 2 initialized at start 200-200+64=64Mb, length 64Mb. Split 3 initialized at start 200-200+128=128Mb, length 64Mb. Split 4 initialized at start 200-200+192=192Mb, length 8Mb. Mapper A will process split 1, start is 0 so don't skip first line, and read a full line which goes beyond the 64Mb limit so needs remote read. Mapper B will process split 2, start is != 0 so skip the first line after 64Mb-1byte, which corresponds to the end of line 1 at 100Mb which is still in split 2, we have 28Mb of the line in split 2, so remote read the remaining 72Mb. Mapper C will process split 3, start is != 0 so skip the first line after 128Mb-1byte, which corresponds to the end of line 2 at 200Mb, which is end of file so don't do anything. Mapper D is the same as mapper C except it looks for a newline after 192Mb-1byte.