HBase NoSQL Interview Preparation (Q&A)

Question: What is the use of "HColumnDescriptor " ?
Answer : An HColumnDescriptor contains information about a column family such as the number of versions, compression settings, etc. It is used as input when creating a table or adding a column. It is used as input when creating a table or adding a column. Once set, the parameters that specify a column cannot be changed without deleting the column and recreating it. If there is data stored in the column, it will be deleted when the column is deleted.

Question:In HBase what is the problem with "Time Series Data" and can you explain the Hotspot ?
Answer : When dealing with stream processing of events, the most common use case is time series data. Such data could be coming from a sensor in a power grid, a stock exchange, or a monitoring system for computer systems. Its salient feature is that its row key represents the event time. This imposes a problem with the way HBase is arranging its rows: they are all stored sorted in a distinct range, namely regions with specific start and stop keys. The sequential, monotonously increasing nature of time series data causes all incoming data to be written to the same region. And since this region is hosted by a single server, all the updates will only tax this one machine. This can cause regions to really run hot with the number of accesses, and in the process slow down the perceived overall performance of the cluster, because inserting data is now bound to the performance of a single machine.

Question: What is salting and How it helps the "TimeSeries HotSpot" problem ?
Answer : It is easy to overcome this problem by ensuring that data is spread over all region servers instead. This can be done, for example, by prefixing the row key with a nonsequential prefix. Common choices include: Salting: You can use a salting prefix to the key that guarantees a spread of all rows across all region servers. For example:
byte prefix = (byte) (Long.hashCode(timestamp) % );
byte[] rowkey = Bytes.add(Bytes.toBytes(prefix), Bytes.toBytes(timestamp);
This formula will generate enough prefix numbers to ensure that rows are sent to all region servers. Of course, the formula assumes a specific number of servers, and if you are planning to grow your cluster you should set this number to a multiple instead. The generated row keys might look like this:
0myrowkey-1, 1myrowkey-2, 2myrowkey-3, 0myrowkey-4, 1myrowkey-5, 2myrowkey-6, ...
When these keys are sorted and sent to the various regions the order would be:
0myrowkey-1
0myrowkey-4
1myrowkey-2
1myrowkey-5
...
In other words, the updates for row keys 0myrowkey-1 and 0myrowkey-4 would be sent to one region (assuming they do not overlap two regions, in which case there would be an even broader spread), and 1myrowkey-2 and 1myrowkey-5 are sent to another.
The drawback of this approach is that access to a range of rows must be fanned out in your own code and read with get or scan calls. On the upside, you could use multiple threads to read this data from distinct servers, therefore parallelizing read access. This is akin to a small map-only MapReduce job, and should result in increased I/O performance.

Question-30 : What is "Field swap/promotion" ?
Answer: you can move the timestamp field of the row key or prefix it with another field. This approach uses the composite row key concept to move the sequential, monotonously increasing timestamp to a secondary position in the row key. If you already have a row key with more than one field, you can swap them. If you have only the timestamp as the current row key, you need to promote another field from the column keys, or even the value, into the row key. There is also a drawback to moving the time to the right-hand side in the composite key: you can only access data, especially time ranges, for a given swapped or promoted field.

HBase NoSQL Interview Preparation (Q&A)

Related Questions