HBase NoSQL Interview Preparation (Q&A)

Question: How scan caching helps in HBase ?
Answer : If HBase is used as an input source for a MapReduce job, for example, make sure that the input Scan instance to the MapReduce job has setCaching set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed. Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn't always better.

Question: What is the impact of Scan Caching in MapReduce Jobs ?
Answer: Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower. Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.

Question: What is the Impact of "Turn off WAL on Puts" ?
Answer : A frequently discussed option for increasing throughput on Puts is to call writeToWAL(false). Turning this off means that the RegionServer will not write the Put to the Write Ahead Log, only into the memstore, HOWEVER the consequence is that if there is a RegionServer failure there will be data loss. If writeToWAL(false) is used, do so with extreme caution. You may find in actuality that it makes little difference if your load is well distributed across the cluster. In general, it is best to use WAL for Puts, and where loading throughput is a concern to use bulk loading techniques instead.

Question: Why to "Pre-Creating Regions " ?
Answer : Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance.
There are two different approaches to pre-creating splits.
The first approach is to rely on the default HBaseAdmin strategy (which is implemented in Bytes.split )..
byte[] startKey = ...; // your lowest keuybyte[] endKey = ...; // your highest keyint numberOfRegions = ...; // # of regions to createadmin.createTable(table, startKey, endKey, numberOfRegions);
And the other approach is to define the splits yourself...
byte[][] splits = ...; // create your own splits admin.createTable(table, splits);

Related Questions

Question: Explain what is WAL and Hlog in Hbase?

Question: In Hbase what is column families? Answer : Column families comprise the basic unit of physical storage in Hbase to which features like compressions are applied. Question: Explain what is the row key? Answer : Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server. Question: Explain deletion in Hbase? Mention what are the three types of tombstone markers in Hbase?

Question: Explain how does Hbase actually delete a row?

Question: Explain what happens if you alter the block size of a column family on an already occupied database? Answer : When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, old data will take the new block size. New files as they are flushed, have a new block size whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction. Question: What is Bloom filter and how it helps ?

Question: How scan caching helps in HBase ?

Question: What is the impact of Scan Caching in MapReduce Jobs ?

Question: What is the Impact of "Turn off WAL on Puts" ?

Question: Why to "Pre-Creating Regions " ?