Question: How scan caching helps in HBase ?
Answer : If HBase is used as an input source for a MapReduce job, for example, make sure that the input Scan instance to the MapReduce job has setCaching set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed. Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn't always better.
Question: What is the impact of Scan Caching in MapReduce Jobs ?
Answer: Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower. Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.
Question: What is the Impact of "Turn off WAL on Puts" ?
Answer : A frequently discussed option for increasing throughput on Puts is to call writeToWAL(false). Turning this off means that the RegionServer will not write the Put to the Write Ahead Log, only into the memstore, HOWEVER the consequence is that if there is a RegionServer failure there will be data loss. If writeToWAL(false) is used, do so with extreme caution. You may find in actuality that it makes little difference if your load is well distributed across the cluster. In general, it is best to use WAL for Puts, and where loading throughput is a concern to use bulk loading techniques instead.
Question: Why to "Pre-Creating Regions " ?
Answer : Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. There are two different approaches to pre-creating splits. The first approach is to rely on the default HBaseAdmin strategy (which is implemented in Bytes.split ).. byte[] startKey = ...; // your lowest keuybyte[] endKey = ...; // your highest keyint numberOfRegions = ...; // # of regions to createadmin.createTable(table, startKey, endKey, numberOfRegions); And the other approach is to define the splits yourself... byte[][] splits = ...; // create your own splits admin.createTable(table, splits);