Cloudera HBase Certification Questions and Answers (Dumps and Practice Questions)

Question : You have following data with a row key ()
10 Engineers:Name, timestamp=1331314762083, value=Stole 10
Engineers:Salary, timestamp=1331314762085, value=8000 10
Skills:Skill_1, timestamp=1331314762088, value=J2ee 10
Skills:Skill_2, timestamp=1331314762091, value=Java

How many regions will be read during a scan of the entire data?

1. One
2. Two
3. Access Mostly Uused Products by 50000+ Subscribers
4. Four

Correct Answer : Get Lastest Questions and Answer : This table consists of one unique rowkey(10) and 2 different column families (Engineers, Skills). All data for a given row in the table is managed together in a region. Regions : Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects is as follows:
Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
MemStore (MemStore for each Store for each Region for the table)
StoreFile (StoreFiles for each Store for each Region for the table)
Block (Blocks within a StoreFile within a Store for each Region for the table)
For a description of what HBase files look like when written to HDFS, see trouble.namenode.hbase.objects. Considerations for Number of Regions : In general, HBase is designed to run with a small (20-200) number of relatively large (5-20Gb) regions per server. The considerations for this are as follows: Why cannot I have too many regions? Typically you want to keep your region count low on HBase for numerous reasons. Usually right around 100 regions per RegionServer has yielded the best results. Here are some of the reasons below for keeping region count low: MSLAB requires 2mb per memstore (that's 2mb per family per region). 1000 regions that have 2 families each is 3.9GB of heap used, and it's not even storing data yet. NB: the 2MB value is configurable. If you fill all the regions at somewhat the same rate, the global memory usage makes it that it forces tiny flushes when you have too many regions which in turn generates compactions. Rewriting the same data tens of times is the last thing you want. An example is filling 1000 regions (with one family) equally and let's consider a lower bound for global memstore usage of 5GB (the region server would have a big heap). Once it reaches 5GB it will force flush the biggest region, at that point they should almost all have about 5MB of data so it would flush that amount. 5MB inserted later, it would flush another region that will now have a bit over 5MB of data, and so on. This is currently the main limiting factor for the number of regions; see ops.capacity.regions.count for detailed formula. The master as is is allergic to tons of regions, and will take a lot of time assigning them and moving them around in batches. The reason is that it's heavy on ZK usage, and it's not very async at the moment (could really be improved?-?and has been imporoved a bunch in 0.96 hbase). In older versions of HBase (pre-v2 hfile, 0.90 and previous), tons of regions on a few RS can cause the store file index to rise, increasing heap usage and potentially creating memory pressure or OOME on the RSs Another issue is the effect of the number of regions on mapreduce jobs; it is typical to have one mapper per HBase region. Thus, hosting only 5 regions per RS may not be enough to get sufficient number of tasks for a mapreduce job, while 1000 regions will generate far too many tasks. See ops.capacity.regions for configuration guidelines.

Question : There is a feature provided in QuickTechie.com website that any Software Professional can create an article as well as can update and delete the
article. You decided to use HBase rather than HDFS to store this article. You have created an ARTICLES table in HBase to store all the versions of the articles
in this table. Select the Column Family attribute settings which will retain at least one version of an article always but expire all other versions that are older than 1 month (30 Days) for a given Column Family?

1. LENGTH = 30, MIN_VERSIONS = 1
2. TTL = 30, VERSIONS = 1

3. Access Mostly Uused Products by 50000+ Subscribers

4. TTL = 2592000 , MIN_VERSIONS = 1

Correct Answer : Get Lastest Questions and Answer :
Explanation: ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC. Store files which contains only expired rows are deleted on minor compaction. Setting hbase.store.delete.expired.storefile to false disables this feature. Setting link:[minimum number of versions] to other than 0 also disables this. See HColumnDescriptor for more information. Recent versions of HBase also support setting time to live on a per cell basis. See HBASE-10560 for more information. Cell TTLs are submitted as an attribute on mutation requests (Appends, Increments, Puts, etc.) using Mutation#setTTL. If the TTL attribute is set, it will be applied to all cells updated on the server by the operation. There are two notable differences between cell TTL handling and ColumnFamily TTLs: Cell TTLs are expressed in units of milliseconds instead of seconds. A cell TTLs cannot extend the effective lifetime of a cell beyond a ColumnFamily level TTL setting.
The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs. It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.
HBase keeps track of timestamp for each row, TTL(time-to-live) and MIN_VERSIONS are used to control 'how many rows' and 'how many versions' to keep after major compactions. TTL uses unit of seconds, therefore 5 days equal to 432000(5X24X60X60) seconds. MIN_VERSIONS controls how many min number of copy to keep. For example,MIN_VERSIONS = 1 instructs HBase to keep at least one copy.
Minimum Number of Versions : Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via HColumnDescriptor. The default for min versions is 0, which means the feature is disabled. The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the number of row versions parameter to allow configurations such as "keep the last T minutes worth of data, at most N versions, but keep at least M versions around" (where M is the value for minimum number of row versions, M less than N). This parameter should only be set when time-to-live is enabled for a column family and must be less than the number of row versions.
"ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC . By default, delete markers extend back to the beginning of time. Therefore, Get or Scan operations will not see a deleted cell (row or column), even when the Get or Scan operation indicates a time range before the delete marker was placed. ColumnFamilies can optionally keep deleted cells. In this case, deleted cells can still be retrieved, as long as these operations specify a time range that ends before the timestamp of any delete that would affect the cells. This allows for point-in-time queries even in the presence of deletes.
Deleted cells are still subject to TTL and there will never be more than "maximum number of versions" deleted cells. A new "raw" scan options returns all deleted rows and the delete markers. Example 18. Change the Value of KEEP_DELETED_CELLS Using HBase Shell
hbase> hbase> alter 't1', NAME => 'f1', KEEP_DELETED_CELLS => true
Example 19. Change the Value of KEEP_DELETED_CELLS Using the API
...
HColumnDescriptor.setKeepDeletedCells(true);

Question : Given that following is your entire data set:

23 column=Engineers:FirstName, timestamp=1331314762084, value=Shobhit
23 column=Engineers:Payment, timestamp=1331314762086, value=800000
23 column=TechnicalSkills:1_FirstSkill, timestamp=1331314762089, value=J2EE
23 column=TechnicalSkills:2_AnotherSkill, timestamp=1331314762092, value=Java

How many sets of physical files will be read during a scan of the entire data set immediately following a major compaction?

1. One
2. Two
3. Access Mostly Uused Products by 50000+ Subscribers
4. Four

Correct Answer : Get Lastest Questions and Answer :

This table consists of one unique rowkey(23) and 2 different column families (Engineers, TechnicalSkills). All data for a given row in the table is managed together in a region. Region size is configurable between 256Mb to 20Gb. In this example, 4 rows can fit within the lowest default region size, 256 Mb, therefore there is one region for this dataset.

Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects is as follows:
Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
MemStore (MemStore for each Store for each Region for the table)
StoreFile (StoreFiles for each Store for each Region for the table)
Block (Blocks within a StoreFile within a Store for each Region for the table)

Determining the "right" region size can be tricky, and there are a few factors to consider:

HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle. This really cant be stressed enough, since a common problem is loading 200MB data into HBase then wondering why your awesome 10 node cluster isn't doing anything.

On the other hand, high region count has been known to make things slow. This is getting better with each release of HBase, but it is probably better to have 700 regions than 3000 for the same amount of data.

There is not much memory footprint difference between 1 region and 10 in terms of indexes, etc, held by the RegionServer.

When starting off, it's probably best to stick to the default region-size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).

Cloudera HBase Certification Questions and Answers (Dumps and Practice Questions)

Related Questions