Cloudera HBase Certification Questions and Answers (Dumps and Practice Questions)

Question : As an HBase administrator at Acmeshell.com you have configured HBase to store a maximum of versions.
You have inserted 7 versions of your data in a Column Family called Acmeshell. At what point are the older versions removed from Acmeshell?

1. Never, the older version has to be manually deleted.
2. The older versions are removed at major compaction.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The older versions are removed at minor compaction.

Correct Answer : Get Lastest Questions and Answer : Deletion in HBase
When a Delete command is issued through the HBase client, no data is actually deleted. Instead a tombstone marker is set, making the deleted cells effectively invisible.
User Scans and Gets automatically filter deleted cells until they get removed.HBase periodically removes deleted cells during compactions.

The tombstone markers are only deleted during major compactions (which compacts all store files to a single one), because in order to prove that a tombstone marker has no effect HBase needs to look at all cells. There are three types of tombstone markers:
version delete marker
Marks a single version of a column for deletion
column delete marker
Marks all versions of a column for deletion
family delete marker
Marks all versions of all columns for a column family for deletion

It is also possible to add a maximum time stamp to column and family delete markers, in which case only versions with a lower timestamp are affected by the delete marker.
HBase allows to perform timerange queries in order to see only the versions in a specified range of time. For example to see the data "as of time T" the range would be set to [0,T+1) (T+1, because in HBase the end time is exclusive).

There is one snag, though. Once a delete marker is set, all cells affected by that marker are no longer visible. If a Put for a column C was issued at time T and is followed by a column delete at time T+X, issuing a time range scan for [0, T+1) will return no data, as deleted cells are never shown.
The write operation will continue to insert new data. The delete operation will mark rows as deleted in tombstones and eventually will be removed when major compaction runs. Major compaction removes delete records, tombstones and old versions.

HBase writes out immutable files as data is added and accumulate more files as time passes, eventually, your read operations get slower.
HBase compaction will rewrite several files into one in order to perform read faster. The major compactions rewrite all files within a column family for a region into a single new one, and remove the older versions. The Minor compactions rewrite the last few files into one larger one, and not all the older versions will be removed.
HBASE-4536 addresses that issue. It is now possible to instruct a column family to retain deleted cells and treat them exactly like ordinary undelete cells (which means they will still contribute to version counts, and can expire with a TTL was set for the column family). This can be done in the Java client by calling HColumnDescriptor.setKeepDeletedCells(true) or through the HBase shell by setting KEEP_DELETED_CELLS=>true for a column family.

When this setting is enabled for a column family, deleted cells are visible to time range scans and gets as long as the requested range does not include the delete marker.

So in the case above a Scan or Get for [0, T+1) will return the Put that was marked as deleted. A Scan or Get for the range [0, T+X+1) will not return the Put as the range does include the delete marker.

This is very useful to provide full "as-of time" queries, for example on back up replicas for production data in case a user accidentally deleted some data.

Question : To analysing the entire QuickTechie.com articles backup table stored in HBase, you found that it is not perfoming well and showing slowness.
You considered the block size option and increased the block size from 64KB to 512KB assuming ARTICLE table size is 1TB. Why does increasing block size improve scan performance?

1. When you increase block size then HBase will reduce the seek on the disk by which scan performance increased.
2. Increasing block size means fewer block indexes that need to be read from disk, which increase scan performance.
3. Access Mostly Uused Products by 50000+ Subscribers
4. None of the above

Correct Answer : Get Lastest Questions and Answer :
Explanation: Do not turn off block cache (You'd do it by setting hbase.block.cache.size to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indices over and over again. If your working set it such that block cache does you no good, at least size the block cache such that hfile indices will stay up in the cache (you can get a rough idea on the size you need by surveying regionserver UIs; you'll see index block size accounted near the top of the webpage).The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved). As HBase reads entire blocks of data for efficient I/O usage, it retains these blocks in an in-memory cache so that subsequent reads do not need any disk operation. In MapReduce, each block is assigned to a map task to process the contained data. This means larger block sizes equal fewer map tasks to run as the number of mappers is driven by the number of blocks that need processing. In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it'll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of HBASE-3551 (recommended!). Therein, the indices that are kept on HBase storefiles (hfile) to facilitate random access may end up occupyng large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices. See the thread a question storefileIndexSize up on the user mailing list.

Most of the time small inefficiencies don't matter all that much. Unfortunately, this is a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated several billion times in your data. An hfile is the file format that HBase uses to store data in HDFS. It contains a multi-layered index which allows HBase to seek to the data without having to read the whole file. The size of those indexes is a factor of the block size (64KB by default), the size of your keys and the amount of data you are storing. For big data sets it's not unusual to see numbers around 1GB per region server, although not all of it will be in cache because the LRU will evict indexes that aren't used.

Question : You have written a Mobile Application for an advertising company called Acmeshell.com.
Your Mobile application needs to retrieve 01011990(start date) to 31121990 (end date) non-sequential rows from a table with 1TB mobiles ads stored as rows .
What does your application need to implement to retrieve data for known row keys ?

1. HTable.get(List(Get) gets)

2. Increase the Block Cache
3. Access Mostly Uused Products by 50000+ Subscribers
4. HTable.get(Get get)

Correct Answer : Get Lastest Questions and Answer : Scan class reads the entire rows, or reads data by specifying a startRow parameter-defining the row key where the scan begins to read from the HBase table. The optional stopRow parameter can be used to limit the scan to a specific row key where it should end the reading. Scan is best for retrieving a range of data in sequentially. HTabe.get(Get get): Extracts specific cells from a given row. HTable.get(List (Get> gets): Extracts specific cells from the given rows in batch.
HTable.get is best for retrieving the non-sequential data with the known rowkeys. In this scenario, you are retrieving 200-300 rows non sequential rows, HTable.get(List (Get> gets) is the better solution. Class HTable
Used to communicate with a single HBase table. An implementation of HTableInterface. Instances of this class can be constructed directly but it is encouraged that users get instances via HConnection and HConnectionManager. See HConnectionManager class comment for an example. This class is not thread safe for reads nor write.

In case of writes (Put, Delete), the underlying write buffer can be corrupted if multiple threads contend over a single HTable instance.
In case of reads, some fields used by a Scan are shared among all threads. The HTable implementation can either not contract to be safe in case of a Get
Instances of HTable passed the same Configuration instance will share connections to servers out on the cluster and to the zookeeper ensemble as well as caches of region locations. This is usually a *good* thing and it is recommended to reuse the same configuration object for all your tables. This happens because they will all share the same underlying HConnection instance. See HConnectionManager for more on how this mechanism works.
HConnection will read most of the configuration it needs from the passed Configuration on initial construction. Thereafter, for settings such as hbase.client.pause, hbase.client.retries.number, and hbase.client.rpc.maxattempts updating their values in the passed Configuration subsequent to HConnection construction will go unnoticed. To run with changed values, make a new HTable passing a new Configuration instance that has the new configuration.
Note that this class implements the Closeable interface. When a HTable instance is no longer required, it *should* be closed in order to ensure that the underlying resources are promptly released. Please note that the close method can throw java.io.IOException that must be handled.
Class Scan
Used to perform Scan operations. All operations are identical to Get with the exception of instantiation. Rather than specifying a single row, an optional startRow and stopRow may be defined. If rows are not specified, the Scanner will iterate over all rows. To scan everything for each row, instantiate a Scan object.

To modify scanner caching for just this scan, use setCaching. If caching is NOT set, we will use the caching value of the hosting HTable. See HTable.setScannerCaching(int). In addition to row caching, it is possible to specify a maximum result size, using setMaxResultSize(long). When both are used, single server requests are limited by either number of rows or maximum result size, whichever limit comes first. To further define the scope of what to get when scanning, perform additional methods as outlined below. To get all columns from specific families, execute addFamily for each family to retrieve.

To get specific columns, execute addColumn for each column to retrieve. To only retrieve columns within a specific range of version timestamps, execute setTimeRange.
To only retrieve columns with a specific timestamp, execute setTimestamp. To limit the number of versions of each column to be returned, execute setMaxVersions.
To limit the maximum number of values returned for each call to next(), execute setBatch. To add a filter, execute setFilter. Expert: To explicitly disable server-side block caching for this scan, execute setCacheBlocks(boolean).

Cloudera HBase Certification Questions and Answers (Dumps and Practice Questions)

Related Questions