Question : You have written a MapReduce job, and in the Reducer you want the data to be adjusted from multiple reducers before writing to the HDFS,Is it possible that reduce tasks to communicate with each other and can talk to each other? . 1. Yes, all reducer task runs can share the data by doing proper configuration 2. Yes, each reduce task runs independently and in isolation, by creating a shared file reducer can communicate with each other 3. Access Mostly Uused Products by 50000+ Subscribers 4. It all depends on file size created if it is smaller than block size then it is possible
Explanation: Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead. Number of Maps : The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute. Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps. The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data. Number of Reduces : The ideal reducers should be the optimal value that gets them closest to: * A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of: * Terrible performance on the next phase of the workflow * Terrible performance due to the shuffle * Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless * Destroying disk IO for no really sane reason * Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work Now, there are always exceptions and special cases. One particular special case is that if following that advice makes the next step in the workflow do ridiculous things, then we need to likely 'be an exception' in the above general rules of thumb. Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces less than heapSize). This will be fixed at some point, but until it is it provides a pretty firm upper bound. The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReduceTasks(int num).When the number of reduce tasks is set to zero, no reduce tasks are executed for that job. The intermediate data produced by the map phase is copied into HDFS as the output without modification. The intermediate data from each mapper becomes a single output file in HDFS. For more information about running a job with zero reducersWhen the number of reduce tasks is set to one, a single reduce task is executed for that job. That reducer processes all intermediate data produced by the map phase and produces a single output file in HDFS. For more information about how shuffle and sort and reduce phases work,Every TaskTracker will periodically send a heartbeat message to the JobTracker. Whenever a TaskTracker sends the heartbeat message, and there are tasks waiting to be assigned to a TaskTracker, the JobTracker will assign that TaskTracker a number of tasks equal to the number of the TaskTracker's slots that are free, up to the number of pending tasks. If the TaskTracker has fewer free slots than the number of pending tasks, the JobTracker will assign the TaskTracker the tasks that will have the minimum number of network hops to their data when executed by that TaskTracker. Ideally the TaskTracker will be assigned tasks that require data that is local to the TaskTracker node or, as the second chice, local to another node in the same rack in which the TaskTracker node is housed ("rack local"). When, however, the TaskTracker has more free slots than the number of pending tasks, then the TaskTracker will be assigned all pending tasks, regardless of where the data for those tasks is located.If a job is submitted that will process a single input split, that job will contain a single map task. As that single map task is the only pending task, the next TaskTracker that sends a heartbeat message will be assigned the task, regardless of data locality. For an explanation of the task execution process see Anatomy of a MapReduce Job Run with Hadoop. For an explanation of how input splits are computed and how they relate to map tasks, see chapter 2 in Hadoop: The Definitive Guide, 3rd Edition in the Scaling Out: Data Flow section. The MapReduce programming model uses isolated execution of tasks with a one-time transfer of information during the shuffle and sort phase. There is no mechanism provided by MapReduce to allow tasks to communicate with each other. (It would be possible, however, to use some external mechanism, such as RMI or a message queue, to communicate between or among tasks in the same phase.) For an explanation of the task execution process see Anatomy of a MapReduce Job Run with Hadoop.
Watch the module : 3 to 10 from http://hadoopexam.com/index.html/#hadoop-training
Question : All HadoopExam website subscribers information is stored in the MySQL database, Which tool is best suited to import a portion of a subscribers information every day as files into HDFS, and generate Java classes to interact with that imported data? 1. Hive 2. Pig 3. Access Mostly Uused Products by 50000+ Subscribers 4. Flume
Correct Answer : Get Lastest Questions and Answer : Explanation: pache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB. What Sqoop Does Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases, Apache Sqoop: Allows data imports from external datastores and enterprise data warehouses into Hadoop Parallelizes data transfer for fast performance and optimal system utilization Copies data quickly from external systems to Hadoop Makes data analysis more efficient Mitigates excessive loads to external systems. How Sqoop Works Sqoop provides a pluggable connector mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems.Sqoop ("SQL-to-Hadoop") is a straightforward command-line tool with the following capabilities: Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact with your imported data Provides the ability to import from SQL databases straight into your Hive data warehouse After setting up an import job in Sqoop, you can get started working with SQL database-backed data from your Hadoop MapReduce cluster in minutes.
The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.
A by-product of the import process is a generated Java class which can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided to you, for use in subsequent MapReduce processing of the data. This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited-text form of a record. These abilities allow you to quickly develop MapReduce applications that use the HDFS-stored records in your processing pipeline. You are also free to parse the delimiteds record data yourself, using any other tools you prefer.
Watch the module : 22 from http://hadoopexam.com/index.html/#hadoop-training
Question :A client application of HadoopExam creates an HDFS file named HadoopExam.txt with a replication factor of . Identify which best describes the file access rules in HDFS if the file has a single block that is stored on data nodes C1, C2, C3, C4 and C5? 1. The file can not be accessed if at least one of the DataNodes storing the block is un-available. 2. The file can be accessed if at least one of the DataNodes storing the block is available and client connected to that node only. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The file can be accessed if at least one of the DataNodes storing the block is available and even NameNode is crashed.
Correct Answer : Get Lastest Questions and Answer : Explanation: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.When a file is written to HDFS it is broken up into HDFS blocks (default size 64MB); those blocks are stored on machines running the DataNode daemon. In our example, each block is stored on three different DataNodes for reliability. Each block of a file is named blk_xxxxxx, where xxxxx is a unique number. (The same block on the three different DataNodes has the same name.) When a client wishes to read the file, it contacts the NameNode which replies with a list of the DataNodes on which each of the blocks resides. The client then contacts the first DataNode containing the first block, and reads the data from there, then contacts the first DataNode containing the second block and so on. If, for any given block, the first DataNode is not available, or the copy of the block is corrupt, the client will automatically contact the second DataNode and, failing that, the third DataNode. So as long as one node containing the block is available, the file can be read. HDFS allows multiple reads of a file simultaneously; it does not allow a file to be modified once it has been written. Replica Placement: The First Baby Steps The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies. Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks. The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks. For the common case, when the replication factor is three, HDFS's placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance. The current, default replica placement policy described here is a work in progress. Replica Selection : To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.
Watch the module : 2 and 2A from http://hadoopexam.com/index.html/#hadoop-training
1. Number of reducer is defined as part of Job Configuration 2. All values of the same key can be processed by multiple reducer. 3. Access Mostly Uused Products by 50000+ Subscribers 4. 1,2 and 3 are correct 5. 1 and 3 are correct
1. Pig provides additional capabilities that allow certain types of data manipulation not possible with MapReduce. 2. Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Pig provides the additional capability of allowing you to control the flow of multiple MapReduce jobs.
1. Input file splits may cross line breaks. A line that crosses tile splits is ignored. 2. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line. 5. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
1. Increase the parameter that controls minimum split size in the job configuration. 2. Write a custom MapRunner that iterates over all key-value pairs in the entire file. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Write a custom FileInputFormat and override the method isSplittable to always return false.