When a HFile is opened, typically when a region is deployed to a RegionServer, the bloom filter is loaded into memory and used to determine if a given key is in that store file.
Get/scan(Row) currently does a parallel N-way get of that row from all StoreFiles in a Region. This means that you are doing N read requests from disk. Bloomfilters provide a lightweight in-memory structure to reduce those N disk reads to only the files likely to contain that row (N-B)
Keep in mind that HBase only has a blcok index per file, which is rather course gained and tells the reader that a key may be in the file because it fails into start and end key range in the block index. But if the key is actually present can only be determined by loading that blcok and scanning it. This also places a burden on the block cache and you may create a lot of unnecessary churn that the bloom filters would help avoid.
Question : You want to do a full table scan on your data. You decide to disable block caching to see if this improves scan performance. Will disabling block caching improve scan performance. Will disabling block caching improve scan performance ? 1. No, disabling blcok caching does not improve scan performance. 2. Yes, when you disable blcok caching, you free up that memory for the other operations. With a full table scan, you can not take take advantage of block caching anyway because your entire table would not fit into cache. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Yes, when you disable block caching, you free up memory for MemStore, which improves, scan performance.
Correct Answer : Get Lastest Questions and Answer : You don't want to be using block cache while doing full table scans as it will be constantly outdated As HBase reads entire blocks of data for efficient I/O usage, it retains these blocks in an in-memory cache so that subsequent reads do not need any disk operation. For a full table scan on a large data set, you may not be able to fit all scan data into the block cache. For a full table scan you'll see better performance if you disable block cache. I think that when you do a large scan you should disable the blockcache becuase the blocks are going to swat a lot, so you didn't get anything from cache, I guess you should be penalized since you're spending memory, calling GC and CPU with this task.
For large scans where it is likely that the entire scan does not fit into the block cache, you should absolutely disable caching the blocks traversed for this scan (i.e. scan.setCacheBlocks(false)). Index blocks are not affected, they are cached regardless.
Servicing reads from the BlockCache is the primary mechanism through which HBase is able to serve random reads with millisecond latency. When a data block is read from HDFS, it is cached in the BlockCache. Subsequent reads of neighboring data ? data from the same block ? do not suffer the I/O penalty of again retrieving that data from disk There is a single BlockCache instance in a region server, which means all data from all regions hosted by that server share the same cache pool 5. The BlockCache is instantiated at region server startup and is retained for the entire lifetime of the process.
It's generally bad to use block caching when the WSS doesn't fit in memory. This is the case when you have for example 40GB available across all your region servers' block caches but you need to process 1TB of data. One of the reasons is that the churn generated by the evictions will trigger more garbage collections unnecessarily. Here are two use cases: ?Fully random reading pattern: This is a case where you almost never access the same row twice within a short amount of time such that the chance of hitting a cached block is close to 0. Setting block caching on such a table is a waste of memory and CPU cycles, more so that it will generate more garbage to pick up by the JVM. ?Mapping a table: In a typical MapReduce job that takes a table in input, every row will be read only once so there's no need to put them into the block cache. The Scan object has the option of turning this off via the setCaching method (set it to false). You can still keep block caching turned on on this table if you need fast random read access. An example would be counting the number of rows in a table that serves live traffic, caching every block of that table would create massive churn and would surely evict data that's currently in use BLOCK CACHE
The block cache acts as a cache to keep data in memory. Blocks themselves are stored on disks. In my experiment, I started with an empty block cache and ran my program twice. Both times the result was the same. The cache became full and it ran into problems with the garbage collector. I increased the block size from 1,048,576 to 67,108,864. This was done to check the increase in block cache size when new data is retrieved from that disk as well as to check the garbage collector behavior. Again the problem was the same; the garbage collector ended up being a bottleneck. The problem lies within the CPU utilization and the garbage collector call time. If the garbage collector kicks when the CPU is running hot, we will run short of resources. Ultimately this will lead to shutdown of your region server (which has a serious impact on performance), so, it?s important to maintain a ?cool environment? around these two parameters. After a long thought process, I decided to disable the block cache, thus avoiding a lot of calls to the garbage collector. Obviously, this has the side effect of reading all data from the disc, so I needed to do a couple of more things to get me where I wanted.
Block Cache improves the performance when the same data is read frequently. It doesn't help in large table scan.
Often, workloads don?t benefit from putting data into a read cache?for instance, if a certain table or column family in a table is only accessed for sequential scans or isn?t accessed a lot and you don?t care if Gets or Scans take a little longer. In such cases, you can choose to turn off caching for those column families. If you?re doing lots of sequential scans, you?re churning your cache a lot and possibly polluting it for data that you can benefit by having in the cache. By disabling the cache, you not only save that from happening but also make more cache available for other tables and other column families in the same table. By default, the block cache is enabled. You can disable it at the time of table creation or by altering the table: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', BLOCKCACHE => 'false?}
Question : Your organization has an HBase cluster with half the nodes in Geneva and half the nodes in Neveda. Which of the following is true? 1. There must be two NameNodes one for Geneva and another for Neveda 2. As they are very far to each other avoid replication of the data and set replication factor=1
Explanation: The replication feature of Apache HBase provides a way to copy data between HBase deployments. It can serve as a disaster recovery solution and can contribute to higher availability. HBase Replication HBase replication provides a means of copying the data from one HBase cluster to another (typically distant) HBase cluster. It is designed for data recovery rather than failover. The cluster receiving the data from user applications is called the master cluster, and the cluster receiving the replicated data from the master is called the slave cluster. Types of Replication You can implement any of the following replication models: Master-slave replication Master-master replication Cyclic replication In all cases, the principle of replication is similar to that of MySQL master/slave replication in which each transaction on the master cluster is replayed on the slave cluster. In the case of HBase, the Write-Ahead Log (WAL) or HLog records all the transactions (Put/Delete) and the master cluster Region Servers ship the edits to the slave cluster Region Servers. This is done asynchronously, so having the slave cluster in a distant data center does not cause high latency at the master cluster. Master-Slave Replication This is the basic replication model, in which transactions on the master cluster are replayed on the slave cluster, as described above. For instructions on configuring master-slave replications, see Deploying HBase Replication. Master-Master Replication In this case, the slave cluster in one relationship can act as the master in a second relationship, and the slave in the second relationship can act as master in a third relationship, and so on. Cyclic Replication In the cyclic replication model, the slave cluster acts as master cluster for the original master. This sort of replication is useful when both the clusters are receiving data from different sources and you want each of these clusters to have the same data. Important: The normal configuration for cyclic replication is two clusters; you can configure more, but in that case loop detection is not guaranteed. Points to Note about Replication You make the configuration changes on the master cluster side. In the case of master-master replication, you make the changes on both sides. Replication works at the table-column-family level. The family should exist on all the slaves. (You can have additional, non replicating families on both sides). The timestamps of the replicated HLog entries are kept intact. In case of a collision (two entries identical as to row key, column family, column qualifier, and timestamp) only the entry arriving later write will be read. Increment Column Values (ICVs) are treated as simple puts when they are replicated. In the master-master case, this may be undesirable, creating identical counters that overwrite one another. (See https://issues.apache.org/jira/browse/HBase-2804.) Make sure the master and slave clusters are time-synchronized with each other. Cloudera recommends you use Network Time Protocol (NTP). Requirements Before configuring replication, make sure your environment meets the following requirements: You must manage Zookeeper yourself. It must not be managed by HBase, and must be available throughout the deployment. Each host in both clusters must be able to reach every other host, including those in the Zookeeper cluster. Both clusters should have the same HBase and Hadoop major revision. For example, having 0.90.1 on the master and 0.90.0 on the slave is supported, but 0.90.1 on one cluster and 0.89.20100725 on the other is not. Every table that contains families that are scoped for replication must exist on each cluster and have exactly the same name. HBase version 0.92 or greater is required for multiple slaves, master-master, or cyclic replication. This version ships with CDH4.0.0.
1. Modify the client application to write to both the old table and a new table while migrating the old data separately 2. Use the ALTER table command to modify the rowkeys 3. Access Mostly Uused Products by 50000+ Subscribers 4. Add a new column to store the userid
1. The timestamp for the entire row must be updated to 13353903160532 2. The Put class allows setting a cell specific timestamp 3. Access Mostly Uused Products by 50000+ Subscribers 4. The HTable class allows you to temporarily roll back the newer versions of the cell