Question : To process the www.HadoopExam.com MAIN.PROFILE.log file You submit a job to a cluster running on MRv. There are 1000 slave nodes in a 100 rack, You have NOT specified a rack topology script. Your job has a single Reducer which runs on Node7 of Rack7. The output file it writes is small enough to fit in a single HDFS block. How does Hadoop handle writing the output file? 1. The first replica of the block will be stored in any node out of 1000 nodes. 2. The first replica of the block will be stored on node7 of Rack7 only. The other two replicas will be stored on other nodes in any rack. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The first replica of the block will be stored on node7 in rack7. The other two replicas will be stored on node6 and node8 in rack7
Explanation: How does it is decided that on which datanode the replica of a block will be kept ?
Well there is a trade-off between the reliability and read and write bandwidth here. Let's suppose we have kept the replication factor as 1, so in this case all the blocks of any data will be kept only once on any machine and if that machine goes down we won't be able to retrieve the data back. So here we got a reliability problem. But again placing all replicas on a single node incurs the lowest write bandwidth penalty which is good.
Now let's say we have kept the replication factor as 5 so in this case if one, two or three of the node goes down we can still get the data blocks. And also while reading the data there is more chance of data locality or data closeness of that datanode with the client. So the data retrieval will be fast, But the write bandwidth will be high in this case, and also the data redundancy will be more.
As we see here that there is a trade off between reliability and read and write bandwidth, In 95% of the cases we keep the replication factor as three which is suitable for most of the use case scenarios in production environment.
Again Hadoop's default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.Hadoop's block placement policy is such that the first copy of any block created by a client running on the cluster (as the Reducer is in this case) will be written on the node running that client. The other two replicas will be stored on other nodes in the cluster. Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : In which of the following scenario we should use HBase 1. If it require random read, write or both 2. If it requires to do many thousands of operations per second on multiple TB of data 3. Access Mostly Uused Products by 50000+ Subscribers 4. All of the above
Apache HBase : Use Apache HBase when you need random, realtime read or write access to your Big Data. HBase goal is the hosting of very large tables - billions of rows X millions of columns - atop clusters of commodity hardware. - If you know the access pattern in advance you can put all the data which are used together in a single column family hence it access become faster.
Refer HadoopExam.com Recorded Training Module : 18
Question : In which scenario HBase should not be used
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand or million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides and Ad-hoc analysis could make your queries slower.