Question : You have a AcmeLog table in HBase. The RowKeys are numbers. You want to retrieve all entries that have row key 100. Which shell command should you use? 1. get 'AcmeLog', (FILTER ='100') 2. get 'AcmeLog', '100'
Explanation: HBase gives you two classes to read data: Get and Scan. The Get class reads data by specifying a single row key and Scan class supports a range scan. In the HBase Shell, a get operation performs the action on a single row. To geteverything for a row, simply execute a get operation with the row to get.
Further Reading The HBase Shell wikiincludes a section on scan which includes: Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp and versions. Examples:
hbase> get 't1', 'r1' hbase> get 't1', 'r1', {COLUMN => 'c1'} hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']} hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1} hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, VERSIONS => 4}
Question : You have a AcmeUsers table in HBase and you would like to insert a row that consists of a AcmeID,jayesh2014 and an email address, john@acmeshell.com. The table has a single Column Family named Meta and the row key will be the Acme's ID. Which command help in this case? 1. put 'AcmeUsers', 'jayesh2014', 'john@acmeshell.com'
2. put 'AcmeUsers', 'Meta:AcmeID', 'jayesh2014', 'Email, 'john@acmeshell.com'
Explanation: In the HBase Shell, you can type put commands to insert a row. put takes 'tableName', 'rowkey','value(optional)', 'columnFamily:columnQualifier', 'value'. Put a cell 'value' at specified table/row/column and optionally timestamp coordinates. To put a cell value into table't1' at row 'r1' under column 'c1' marked with the time 'ts1', do: hbase> put 't1', 'r1', 'c1', 'value', ts1
Question : You are storing page view data for a large number of Web sites, each of which has many subdomains (www.acmeshell.com, archive.acmeshell.com, beta.acmeshell.com, etc.). Your reporting tool needs to retrieve the total number of page views for a given subdomain of a Web site. Which of the following rowkeys should you use? 1. The domain name followed by the URL
Correct Answer : Get Lastest Questions and Answer : Explanation: HBase will normally split a region in 2 at it's mid point when it reaches hbase.hregion.max.filesize (depending on the split policy). You can rely on automatic splitting and but you'll end with odd and lexically uneven split points because of the nature of your rowkeys (lots of "com" domains against few "org" domains). It may be not your exact case but think of this potential issue: Starting with an empty table with just 1 region you insert 145M domains sequentially, starting from com.. and ending in org.. At 80 million mark (a fictitious com.nnnn.www), the region automatically splits into 2 at "com.f*", resulting in 2 40 million regions, and continues writing rows into region 2 At 120 million mark (a fictitious com.yyyy.www), the second region reaches the max filesize and splits into 2 40 million regions at "com.p*" and continues writing rows into region 3. The job ends with the 150M domains, no more splits are performed. Given this case, Regions 1 & 2 will store 40M rows each one but Region 3 will store 65M rows (it would be splitted at 80M, but it maybe never reach that amount). Also, since you'll write always to the last region (even with batching enabled), the job would be a lot slower than issuing batches of writes to multiple regions at the same time. Another problem, imagine you realize you also need to add .us domains (10M). Given this design they will go to the Region 3, increasing the amount of rows hosted to 75M. The common approach to ensure even distribution of keys among regions is to prepend to the rowkey a few chars of the md5 of the key (in this case the domain name). In HBase, the very first bytes of the row keys determine the region that will host it. Just by prepending a few chars of the md5 would be enough to prevent as much as hotspotting as possible (one region getting too much writes) and to get good automatic splits, but it's generally recommended to pre-split tables to ensure even better splitting. If you prepend 2 chars of the md5 to your rowkeys you can pre-split the table with 15 split points: "10", "20", "30" .. until "e0". That will create 16 regions and in case any of them needs to be automatically splitted it will be done at their mid point. i.e: When the region starting at "a0" and ending in "af" reaches hbase.hregion.max.filesize it will be splitted about "a8" and each one of the regions will store half of the "a" bucket. This is an example of which regions would host each row if you have 16 pre-split regions with 2 char prefixed row keys: - Region 1 --------- 0b|com.example4.www - Region 2 --------- 1b|org.example.www 10|com.example.www - Region 5 --------- 56|com.example3.www Given a lot more domains it would end being much more even and almost all regions would store the same amount of domains. In most of cases having 8-16 pre-split regions will be more than enough, but if not, you can go for 32 or even 64 pre-split regions, until a max of 256 (that would be having "01", "02", "03" ... "9f", "a0", "a1" ... until "fe") This data access pattern is to retrieve the total number of page views for a given subdomain of a web site. It is best to store the subdomain data clustered together because HBase is really good at scanning clustered data. If you store the data with the reverse domain name, the same subdomain data will be clustered together. You can efficiently calculate total number of page views across the subdomains.