Question :What is HBASE? 1. Hbase is separate set of the Java API for Hadoop cluster 2. Hbase is a part of the Apache Hadoop project that provides interface for scanning large amount of data using Hadoop infrastructure 3. Access Mostly Uused Products by 50000+ Subscribers 4. HBase is a part of the Apache Hadoop project that provides a SQL like interface for data processing.
Explanation: Hbase is one of the Hadoop framework projects that allow real time data scans across big data volumes. This is very often used to serve data from a cluster
Question :What is the role of the namenode? 1. Namenode splits big files into smaller blocks and sends them to different datanodes 2. Namenode is responsible for assigning names to each slave node so that they can be identified by the clients 3. Access Mostly Uused Products by 50000+ Subscribers 4. Both 2 and 3 are valid answers
Explanation: The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests
Question : What happen if a datanode loses network connection for a few minutes?
1. The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, administrator will need to manually delete the extra replicas 2. All data will be lost on that node. The administrator has to make sure the proper data distribution between nodes 3. Access Mostly Uused Products by 50000+ Subscribers 4. The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra replicas will be deleted
Ans : 4 Exp : : The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not available it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, over replicated data will be deleted. Note: the data might be deleted from the original datanode.
Question : What happen if one of the datanodes has much slower CPU? How will it affect the performance of the cluster?
1. The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact 2. The slowest worker will significantly impact job execution time. It will slow everything down 3. Access Mostly Uused Products by 50000+ Subscribers 4. It depends on the level of priority assigned to the task. All high priority tasks are executed in parallel twice. A slower datanode would therefore be bypassed. If task is not high priority, however, performance will be affected. Ans : 1 Exp : Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed
Question :
If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache hadoop configuration)?
1. 3 2. 6 3. Access Mostly Uused Products by 50000+ Subscribers 4. 12 Ans : 2 Exp : Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings (default 3). 2 * 3 = 6 .
Question : What is replication factor?
1. Replication factor controls how many times the namenode replicates its metadata 2. Replication factor creates multiple copies of the same file to be served to clients 3. Access Mostly Uused Products by 50000+ Subscribers 4. None of these answers are correct. Ans : 3 Exp : Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.
Question :
How does the Hadoop cluster tolerate datanode failures?
1. Failures are anticipated. When they occur, the jobs are re-executed. 2. Datanodes talk to each other and figure out what need to be re-replicated if one of the nodes goes down 3. Access Mostly Uused Products by 50000+ Subscribers 4. Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively maintains replication factor on all data. Ans : 4 Exp : The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the moment a disconnect is detected.
Question :
Which of the following tool, defines a SQL like language..
Question : As a client of HadoopExam, you are able to access the Hadoop cluster of HadoopExam Inc, Once a your application validates its identity and is granted access to a file in a cluster, what is the remainder of the read path back to the client? 1. The NameNode gives the client the block IDs and a list of DataNodes on which those blocks are found, and the application reads the blocks directly from the DataNodes. 2. The NameNode maps the read request against the block locations in its stored metadata, and reads those blocks from the DataNodes. The client application then reads the blocks from the NameNode. 3. Access Mostly Uused Products by 50000+ Subscribers 4. DataNode closest to the client according to Hadoop's rack topology. The client application then reads the blocks from that single DataNode.
Explanation: When a client wishes to read a file from HDFS, it contacts the NameNode and requests the locations and names of the first few blocks in the file. It then directly contacts the DataNodes containing those blocks to read the data. It would be very wasteful to move blocks around the cluster based on a client's read request, so that is never done. Similarly, if all data was passed via the NameNode, the NameNode would immediately become a serious bottleneck and would slow down the cluster operation dramatically.
First, lets walk through the logic of performing an HDFS read operation. For this, well assume theres a file /user/esammer/foo.txt already in HDFS. In addition to using Hadoops client library-usually a Java JAR file-each client must also have a copy of the cluster configuration data that specifies the location of the namenode The client begins by contacting the namenode, indicating which file it would like to read. The client identity is first validated-either by trusting the client and allowing it to specify a username or by using a strong authentication mechanism such as Kerberos and then checked against the owner and permissions of the file. If the file exists and the user has access to it, the namenode responds to the client with the first block ID and the list of datanodes on which a copy of the block can be found, sorted by their distance to the client. Distance to the client is measured according to Hadoops rack topology-configuration data that indicates which hosts are located in which racks. If the namenode is unavailable for some reason-because of a problem with either the namenode itself or the network, for example-clients will receive timeouts or exceptions (as appropriate) and will be unable to proceed. With the block IDs and datanode hostnames, the client can now contact the most appropriate datanode directly and read the block data it needs. This process repeats until all blocks in the file have been read or the client closes the file stream.
It is also possible that while reading from a datanode, the process or host on which it runs, dies. Rather than give up, the library will automatically attempt to read another replica of the data from another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. Another corner case that can occur is that the information returned by the namenode about block locations can be outdated by the time the client attempts to contact a datanode, in which case either a retry will occur if there are other replicas or the read will fail. While rare, these kinds of corner cases make troubleshooting a large distributed system such as Hadoop so complex.
1. The physical RAM limit for each Map and Reduce task 2. The JVM heap size limit for each task. 3. The amount of virtual memory each task will receive. 4. 1 and 3 5. All 1,2 and 3
1. Two active NameNodes and two Standby NameNodes 2. One active NameNode and one Standby NameNode 3. Two active NameNodes and on Standby NameNode 4. Unlimited. HDFS High Availability (HA) is designed to overcome limitations on the number of NameNodes you can deploy
Question : Table schemas in Hive are: 1. Stored as metadata on the NameNode 2. Stored along with the data in HDFS 3. Stored in the Metadata 4. Stored in ZooKeeper 5. Stored in Hive Metastore