Cloudera HBase Certification Questions and Answers (Dumps and Practice Questions)

Question : You have a AcmeLog table in HBase. The RowKeys are numbers.
You want to retrieve all entries that have row key 100.
Which shell command should you use?

1. get 'AcmeLog', (FILTER ='100')
2. get 'AcmeLog', '100'

3. Access Mostly Uused Products by 50000+ Subscribers
4. scan 'AcmeLog', '100'

Correct Answer : Get Lastest Questions and Answer :

Explanation: HBase gives you two classes to read data: Get and Scan. The Get class reads data by specifying a single row key and Scan class supports a range scan. In the HBase Shell, a get operation performs the action on a single row. To geteverything for a row, simply execute a get operation with the row to get.

Further Reading
The HBase Shell wikiincludes a section on scan which includes:
Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp and versions. Examples:

hbase> get 't1', 'r1'
hbase> get 't1', 'r1', {COLUMN => 'c1'}
hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, VERSIONS => 4}

Question : You have a AcmeUsers table in HBase and you would like to insert a row that consists
of a AcmeID,jayesh2014 and an email address, john@acmeshell.com. The table has a single Column Family
named Meta and the row key will be the Acme's ID. Which command help in this case?

1. put 'AcmeUsers', 'jayesh2014', 'john@acmeshell.com'

2. put 'AcmeUsers', 'Meta:AcmeID', 'jayesh2014', 'Email, 'john@acmeshell.com'

3. Access Mostly Uused Products by 50000+ Subscribers

4. put 'AcmeUsers', 'AcmeID:jayesh2014', 'Email:john@acmeshell.com'

Correct Answer : Get Lastest Questions and Answer :

Explanation: In the HBase Shell, you can type put commands to insert a row. put takes 'tableName', 'rowkey','value(optional)', 'columnFamily:columnQualifier', 'value'.
Put a cell 'value' at specified table/row/column and optionally timestamp coordinates. To put a cell value into table't1' at row 'r1' under column 'c1' marked with the time 'ts1', do:
hbase> put 't1', 'r1', 'c1', 'value', ts1

Question : You are storing page view data for a large number of Web sites, each of which has
many subdomains (www.acmeshell.com, archive.acmeshell.com, beta.acmeshell.com, etc.). Your reporting tool needs
to retrieve the total number of page views for a given subdomain of a Web site. Which of the following rowkeys should you use?

1. The domain name followed by the URL

2. The URL followed by the reverse domain name

3. Access Mostly Uused Products by 50000+ Subscribers

4. The URL

Correct Answer : Get Lastest Questions and Answer :
Explanation: HBase will normally split a region in 2 at it's mid point when it reaches hbase.hregion.max.filesize (depending on the split policy). You can rely on automatic splitting and but you'll end with odd and lexically uneven split points because of the nature of your rowkeys (lots of "com" domains against few "org" domains). It may be not your exact case but think of this potential issue:
Starting with an empty table with just 1 region you insert 145M domains sequentially, starting from com.. and ending in org..
At 80 million mark (a fictitious com.nnnn.www), the region automatically splits into 2 at "com.f*", resulting in 2 40 million regions, and continues writing rows into region 2
At 120 million mark (a fictitious com.yyyy.www), the second region reaches the max filesize and splits into 2 40 million regions at "com.p*" and continues writing rows into region 3.
The job ends with the 150M domains, no more splits are performed.
Given this case, Regions 1 & 2 will store 40M rows each one but Region 3 will store 65M rows (it would be splitted at 80M, but it maybe never reach that amount). Also, since you'll write always to the last region (even with batching enabled), the job would be a lot slower than issuing batches of writes to multiple regions at the same time. Another problem, imagine you realize you also need to add .us domains (10M). Given this design they will go to the Region 3, increasing the amount of rows hosted to 75M. The common approach to ensure even distribution of keys among regions is to prepend to the rowkey a few chars of the md5 of the key (in this case the domain name). In HBase, the very first bytes of the row keys determine the region that will host it. Just by prepending a few chars of the md5 would be enough to prevent as much as hotspotting as possible (one region getting too much writes) and to get good automatic splits, but it's generally recommended to pre-split tables to ensure even better splitting. If you prepend 2 chars of the md5 to your rowkeys you can pre-split the table with 15 split points: "10", "20", "30" .. until "e0". That will create 16 regions and in case any of them needs to be automatically splitted it will be done at their mid point. i.e: When the region starting at "a0" and ending in "af" reaches hbase.hregion.max.filesize it will be splitted about "a8" and each one of the regions will store half of the "a" bucket.
This is an example of which regions would host each row if you have 16 pre-split regions with 2 char prefixed row keys:
- Region 1 ---------
0b|com.example4.www
- Region 2 ---------
1b|org.example.www
10|com.example.www
- Region 5 ---------
56|com.example3.www
Given a lot more domains it would end being much more even and almost all regions would store the same amount of domains. In most of cases having 8-16 pre-split regions will be more than enough, but if not, you can go for 32 or even 64 pre-split regions, until a max of 256 (that would be having "01", "02", "03" ... "9f", "a0", "a1" ... until "fe") This data access pattern is to retrieve the total number of page views for a given subdomain of a web site. It is best to store the subdomain data clustered together because HBase is really good at scanning clustered data. If you store the data with the reverse domain name, the same subdomain data will be clustered together. You can efficiently calculate total number of page views across the subdomains.

Related Questions

Question : Which one of the client interface to HBase you can use, so you will
have functionality needed to store and retrive data from HBase as well as delete obsolete values

1. HMasterInterface
2. HTable
3. Access Mostly Uused Products by 50000+ Subscribers
4. HTablePool

Question : Which is/are the statement correct..

A. All operations taht mutate data are gaurnteed to be atomic on a per-row-basis
B. A reading client will not be affected by another updating a particular row
C. Many client can update the same row at same time

1. A,C correct
2. A,B correct
3. Access Mostly Uused Products by 50000+ Subscribers
4. All A,B,C are correct

Question : Select the correct statement..

A. Create HTable instances only once, usually when your application start
B. Create a separate HTable instance for every thread you execute or use HTablePool
C. Updates are atomic on per row basis

1. Only A,B correct
2. Only B,C correct
3. Access Mostly Uused Products by 50000+ Subscribers
4. All A,B,C are correct

Question : You have data already stored in HDFS and are considering using HBase. Which additional feature does HBase provide to HDFS ?

1. Random Writes
2. Batch Processing
3. Access Mostly Uused Products by 50000+ Subscribers
4. Scalablity

Question :

Your client application calls the following method for all puts to the single table notiifcations put.setWriteToWAL(false);
One region, region1 for the notifications table is assigned to RegionServer rs1. Which of the following statements describes the result of
RegionServer rs1 crashing ?

1. All data in the notifications table is lost
2. No data is lost
3. Access Mostly Uused Products by 50000+ Subscribers
4. Data for your client application in the MemStores for region1 is lost.

Question :

Which of the following configuration values determines automated splitting ?

1. hbase.hregion.majorcompaction
2. hbase.hregion.flush.size
3. Access Mostly Uused Products by 50000+ Subscribers
4. hbase.hregion.max.filesize