Question : From the Acmeshell.com website you have your all the data stored in Oracle database table called MAIN.PROFILES table. In HDFS you already have your Apache WebServer log file stored called users_activity.log . Now you want to combine/join both the data users_activity.log file and MAIN.PROFILES table. Initailly, you want to import the table data from the database into Hive using Sqoop with the delimeter (;) and column order remain same. Now select the correct MapReduce code snippet which can produce the csv file, so that we can load the output of MapReduce job in the HIve table created in above steps called PROFILE. 1. 1 2. 2 3. Access Mostly Uused Products by 50000+ Subscribers 4. 4
Explanation: String.split() : It returns the array of strings computed by splitting this string around matches of the given regular expression.By default, split drops all empty trailing columns, so any attempt to access the final column when the sex field is empty will result in an ArrayIndexOutOfBoundsException. Passing a negative number as the the second arg to split causes it to retain the trailing empty columns.Splits this string around matches of the given regular expression. The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.When producing output to be loaded directly into Hive, either the delimiters need to be set to match the delimiters used in the Hive table, or the data should be emitted in the key (using the table field delimiters) with the value being a NullWritable instance. When the value is a NullWritable instance, the delimiter is omitted.The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.Replaces each substring of this string that matches the given regular expression with the given replacement. An invocation of this method of the form str.replaceAll(regex, repl) yields exactly the same result as the expression
Pattern.compile(regex).matcher(str).replaceAll(repl) Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll. Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired.
Question : As part of HadoopExam consultency team, you have been given a requirement by a Hotel to create a GUI apllication, so all the hotel's sales or booking you will add and edit the customer information, and you dont want to spend the money on enterprize RDBMS, hence decided simple file as a storage and considered the csv file. So HDFS is the better choice for storing such information in the file. 1. No, because HDFS is optimized for read-once, streaming access for relatively large files. 2. No, because HDFS is optimized for write-once, streaming access for relatively large files. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Yes, because HDFS is optimized for write-once, streaming access for relatively large files.
Explanation: What HDFS Does HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce. HDFS will "just work" under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow with demand while remaining economical at every size. These specific features ensure that the Hadoop clusters are highly functional and highly available: Rack awareness allows consideration of a node's physical location, when allocating storage and scheduling tasks Minimal data motion. MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack and provides very high aggregate read/write bandwidth.Utilities diagnose the health of the files system and can rebalance the data on different nodes Rollback allows system operators to bring back the previous version of HDFS after an upgrade, in case of human or system errors Standby NameNode provides redundancy and supports high availability Highly operable. Hadoop handles different types of cluster that might otherwise require operator intervention. This design allows a single operator to maintain a cluster of 1000s of nodes. How HDFS Works : An HDFS cluster is comprised of a NameNode which manages the cluster metadata and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas.HDFS has a client API that can be used to access data from any Java application, even those that are not MapReduce applications. HDFS is optimized, however, for streaming access to relatively large files, and HDFS does not allow arbitrary modification of a file once it has been written. (In some cases it may be possible to append to a file after it has been written.) For more information, see chapter 2 of Hadoop: The Definitive Guide, 3rd Edition in the Design of HDFS section .The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the datanodes. The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to: replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, or shut down the node.
Watch the module : 2 from http://hadoopexam.com/index.html/#hadoop-training
Question : Please identify the statement which can correctly describe the use of RAM, of the NameNode 1. To store filenames, initial 100 lines from the each stored file in HDFS. 2. To store filenames, and while reading the file work as a buffer. 3. Access Mostly Uused Products by 50000+ Subscribers 4. To store filenames, list of blocks but no metadata.
Explanation: he NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy. Hadoop 0.21+ has a BackupNameNode that is part of a plan to have an HA name service, but it needs active contributions from the people who want it (i.e. you) to make it Highly Available. It is essential to look after the NameNode. Here are some recommendations from production use Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size. Use ECC RAM. Ram Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down. List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data. Configure the NameNode to store one set of transaction logs on a separate disk from the image. Configure the NameNode to store another set of transaction logs to a network mounted disk. Monitor the disk space available to the NameNode. If free space is getting low, add more storage. Do not host DataNode, JobTracker or TaskTracker services on the same system.The NameNodes' role is to translate file names into block references and provide locations for blocks. To make those operations as fast as pobbile, the full copy of the data is maintained in memory. That data includes file names, block locations, and other metadata about the files.
Watch the module : 2 and 2 A from http://hadoopexam.com/index.html/#hadoop-training
1. Hadoop attempts that Mappers run on node which hold their portio of data locally. 2. Multiple Mappers run parallely 3. Access Mostly Uused Products by 50000+ Subscribers 4. 1 and 2 are correct 5. 1,2 and 3 are correct