Mapr (HP) HBase Developer Certification Questions and Answers (Dumps and Practice Questions)

Question : Which one of the client interface to HBase you can use, so you will
have functionality needed to store and retrive data from HBase as well as delete obsolete values

1. HMasterInterface
2. HTable
3. Access Mostly Uused Products by 50000+ Subscribers
4. HTablePool

Answers: 2

Explanation: HTable is the class in org.apache.hadoop.hbase.client package , which
provides the user with all the functionality needed to store and retriev data from HBase as well as
delete obsolete values and so on.

Question : Which is/are the statement correct..

A. All operations taht mutate data are gaurnteed to be atomic on a per-row-basis
B. A reading client will not be affected by another updating a particular row
C. Many client can update the same row at same time

1. A,C correct
2. A,B correct
3. Access Mostly Uused Products by 50000+ Subscribers
4. All A,B,C are correct

Answer with explanation:

2 is the best answer.

In HBase data are gaurnteed to be atomic on a per-row-basis a. This affects
all other concurrent readers and writers of that same row.

However, it should be avoided to many clients update the same row at the same time.

Question : Select the correct statement..

A. Create HTable instances only once, usually when your application start
B. Create a separate HTable instance for every thread you execute or use HTablePool
C. Updates are atomic on per row basis

1. Only A,B correct
2. Only B,C correct
3. Access Mostly Uused Products by 50000+ Subscribers
4. All A,B,C are correct

Correct Answer : Get Lastest Questions and Answer :

Creating HTable instances are costly, which involves scanning the .META table to check if the table
actually exists and if it is enabled. As well as fet other operations that make this call quite costly.

Hence it is recommended that you create HTable instance only once - and one per thread and reuse that instance for the rest of the lifetime
of you client application.

Related Questions

Question : Because MapR-DB does not require ______________ to be reconstructed, databases can be brought up instantly on the mirrored site if the active site goes down.

1. Volume

2. RegionServers

3. Access Mostly Uused Products by 50000+ Subscribers

4. Tables

Question : During data mirroing, MapR copies the entire file block, if it has changes since last time mirroring.

1. True
2. False

Question : MapR-DB is a multi-model database, supporting the easy-to-use

1. A,B
2. B,C
3. Access Mostly Uused Products by 50000+ Subscribers
4. A,D
5. B,D

Question : With the help of MapR-DB, Real-Time Operational Analytics is possible.

1. True
2. False

Question :

You have already defined a Column Family with five columns in it and a New Requirement comes upto add new column in this existing column family, which of the following statement
is correct ?

1. You can not add new colum to existing column family.
2. You need to empty the column family and add new column and repopulate the data.
3. Access Mostly Uused Products by 50000+ Subscribers
4. None of the above are correct
Ans : 3

Exp : A table's column families must be specified upfront as part of table schema definition. However, new column family members can be added on demand.

Question :

Physically all column family members are stored separately if they have more than three columns ?

1. True
2. Flase
Ans : 2

Exp : Physically all column family members are stored together on the file system.

Question :

Which is the correct statement in case of HBase database

1. In HBase table rows are sorted
2. Columns can be added on the fly if column family pre-exist
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above
Ans : 4

Question : Which of the following is valid syntex to represent singke column family.

1. Emplyee:name, Emplyee:Salary, Employee:age
2. Employee:firstName, manager:employee:salary, Employee:age
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above

Ans : 1
Exp : Row columns are grouped in a column families all column family members have a common prefix, so for example the column employee:name,employee:salary,employee:age are
all mebers of employee column family, and options 2 and 3 ae not valid syntex

Question : Initiallly a table comprises of two regions one started filling first and second for reservation

1. True
2. False
Ans : 2

Exp : Initially a table comprises a single regionbut the size of the region grows, after it crosses a configurable size threshold, it split at row boundary into two new
regions of approximately equal size.

Question : Which one is the correct statement for HBase regions.
1. Tables are automatically partitioned horizontally by HBase into regions
2. Regions are the unit that get distributed over on HBase cluster
3. Access Mostly Uused Products by 50000+ Subscribers

1. 1 and 2 are correct
2. 2 and 3 Are correct
3 1 and 3 are correct
4. 1 ,2 and 3 are correct

Ans : 4

Question : There is a column family of employee, current value in a row one (john,$,) : (name,salary,age).
Now your one application is updating the salary to $5500 and at the same time application two is fecthing salary (Same monment), what value will be return to application two

1. $5000
2. $5500
3. Access Mostly Uused Products by 50000+ Subscribers
4. Error will be thrown

Ans : 1

Exp : Row updates are atomic, nomatter how many row columns it has. All transaction are the row level transaction.

Question : Which of the following is responsibility of HBase master

1. Boostrapping a region installed
2. Assigning regions to registerd region servers
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above
Ans : 4

Question : Which statement are true

1. HBase depends on ZooKeeper
2. HBase mangers are Zookeeper instance as authority on cluster state
3. Access Mostly Uused Products by 50000+ Subscribers

1. Only 3
2. Only 2 and 3
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above
Ans : 4

Question : Completebulkload Tool is used import the data into the running cluster

1. True
2. False
Ans : 1

Question : You have one primary HMaster and one standby. Your primary HMaster Falls fails and your client application
needs to make a metadata change. Which of the following is the effect on your client application?

1. The client will query ZooKeeper to find the location of the new HMaster and complete the metadata change.
2. The client will make the metadata change regardless of the slate of the HMaster.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The client application will fail with a runtime error.
Ans : 1

Exp:
The HBase master publishes its location to clients via Zookeeper. This is done to support multimaster
operation (failover). So if the HBase master self-discovers its location as a localhost address, then it will
publish that. Region servers or clients which go to Zookeeper for the master location will get back an address
in that case only useful if they happen to be co-located with the master.
Note:
* HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all
RegionServer instances in the cluster, and is the interface for all metadata changes.

Question : You have an average key-value pair size of bytes. Your primary access is random needs on the table.
Which of the following actions will speed up random reading performance on your cluster?

1. Turn off WAL on puts
2. Increase the number of versions kept
3. Access Mostly Uused Products by 50000+ Subscribers
4. Increase the block size
Ans : 3

Exp : Larger block size is preferred if files are primarily for sequential access. Smaller blocks are good for random
access, but require more memory to hold the block index, and may be slower to create

Question : The cells in a given row have versions that range from to . You execute a delete specifying the
value 3000 for the version. What is the outcome?

1. The delete fails with an error.
2. Only cells equal to the Specified version are deleted.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Nothing in the row is deleted.

Ans : 3

Exp : When performing a delete operation in HBase, there are two ways to specify the versions to be deleted

Delete all versions older than a certain timestamp

Delete the version at a specific timestamp

A delete can apply to a complete row, a complete column family, or to just one column. It is only in the last
case that you can delete explicit versions. For the deletion of a row or all the columns within a family, it always
works by deleting all cells older than a certain version.

Deletes work by creating tombstone markers. For example, let's suppose we want to delete a row. For this you
can specify a version, or else by default the currentTimeMillis is used. What this means is "delete all cells
where the version is less than or equal to this version". HBase never modifies data in place, so for example a
delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the
delete condition. Rather, a so-called tombstone is written, which will mask the deleted values[17]. If the version
you specified when deleting a row is larger than the version of any value in the row, then you can consider the
complete row to be deleted

Question : You have an "Employees" table in HBase. The Row Keys are the employees' IDs. You would like to retrieve all
employees who have an employee ID between 'user_100' and 'user_110'. The shell command you would use
to complete this is:

1. scan 'Employees', {STARTROW => 'user_100', STOPROW => 'user_111'}
2. get 'Employees', {STARTROW => 'user_100', STOPROW => 'user_110'}
3. Access Mostly Uused Products by 50000+ Subscribers
4. scan 'Employees', {STARTROW => 'user_100', STOPROW => 'user_110'}
Ans : 4

Exp : public Scan(byte[] startRow,
byte[] stopRow)
Create a Scan operation for the range of rows specified.
Parameters:
startRow - row to start scanner at or after (inclusive) stopRow - row to stop scanner before (exclusive)
Reference:o rg.apache.hadoop.hbase.client, Class Scan

Question : Under default settings, which feature of HBase ensures that data won't be lost in the event of a RegionServer
failure?

1. All HBase activity is written to the WAL, which is stored in HDFS
2. All operations are logged on the HMaster
3. Access Mostly Uused Products by 50000+ Subscribers
4. Data is stored on the local filesystem of the RegionServer.
Ans : 1
Exp : HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region
server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent
data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore.
In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the
updates (also called edits) from the WAL file.

Question : You want to store clickstream data in HBase. Your data consists of the following: the source id, the name of the cluster, the URL of the click, the timestamp
for each click

Which rowkey would you use if you wanted to retrieve the source ids with a scan and sorted with the most recent first?

1. (Long)timestamp
2. (source_id)(Long.MAX_VALUE – (Long)timestamp)
3. Access Mostly Uused Products by 50000+ Subscribers
4. (Long.MAX_VALUE)(timestamp)
Ans : 2

Question : You have two standbys and one primary HMaster. Your primary HMaster fails. Which of the remaining
HMasters becomes the new primary?

1. Whichever HMaster first responds to ZooKeeper
2. Whichever HMaster ZooKeeper randomly selects
3. Access Mostly Uused Products by 50000+ Subscribers
4. Whichever HMaster has the lower IP address

Ans : 3
The Hbase master server creates the zookeeper znode /hbase . This is then used for hbase daemons to
coordinate. Even the name of the active Hbase master is stored here. If the hbase master dies, the backup
hbase master overwrites the contents of the znode so clients and region servers know about the new master.
Apart from this, region info is maintained in zookeeper znodes as well.
* Multi-master feature introduced in 0.20.0 does not add cooperating Masters; there is still just one working
Master while the other backups wait. For example, if you start 200 Masters only 1 will be active while the
others wait for it to die. The switch usually takes zookeeper.session.timeout plus a couple of seconds to occur.

you can just start up a second Master on some other machine and
it automatically becomes the backup master. This functionality is built in
to HBase and the Zookeeper quorum takes care of leader election. Basically
the first HBase master to start up becomes the "Active" master and any
other masters that you start up become "Backup". Should your active master
die, ZK will detect this and make your backup master become the active one.
This could take as long as the ZK session timeout to happen, plus some
time for the backup master to do some housekeeping, but all the data that a
Master operates on is persisted in ZK and HDFS, so this failover is
seamless.

/hbase/master (zookeeper.znode.master) The ?active? master will register its own address in this znode at startup, making this znode the source of truth for identifying
which server is the Master.

/hbase/backup-masters (zookeeper.znode.backup.masters) Each inactive Master will register itself as backup Master by creating a sub-znode (hbase/backup-master/m1.host).
This znode is mainly used to track which machines are available to replace the Master in case of failure.

Question : Data is written to the HLog in which of the following orders?

1. In order of writes
2. In order of writes, separated by region
3. Access Mostly Uused Products by 50000+ Subscribers
4. Descending first by region and second by row key

Ans : 4

Question : Your application needs to retrieve to non-sequential rows from a table with one billion rows. You know the rowkey of each of the rows you need to
retrieve. Which does your application need to implement?

1. Scan without range
2. Scan with start and stop row
3. Access Mostly Uused Products by 50000+ Subscribers
4. HTable.get(List(Get) gets)
Ans : 4

Question : You have a table with the following rowkeys:
r1, r2, r3, r10, r15, r20, r25, r30, r35
In which order will these rows be retrieved from a scan?

1. r35, r30, r3, r25, r20, r2, r15, r10, r1
2. r1, r2, r3, r10, r15, r20, r25, r30, r35
3. Access Mostly Uused Products by 50000+ Subscribers
4. r35, r30, r25, r20, r15, r10, r3, r2, r1

Ans : 4

Exp : If you can have the table receiving rows always in decreasing order of the row keys, you then have easy
access to the first and last rows. This is possible because HBase tables are always sorted by row key.

Question : You need to create a "WebLogs" table in HBase. The table will consist of a single Column Family called
"Errors" and two column qualifiers, "IP" and "URL". The shell command you should use to create the table is:

1. create 'WebLogs', {NAME => 'Errors:IP', NAME =>'Errors:URL'}
2. create 'WebLogs', 'Errors' {NAME => 'IP', NAME => 'URL'}
3. Access Mostly Uused Products by 50000+ Subscribers
4. create 'WebLogs', 'Errors'

Ans : 3
Columns in Apache HBase are grouped into column families.

All column members of a column family have the
same prefix.

For example, the columns courses:history and courses:math are both members of the courses
column family.

The colon character (:) delimits the column family from the column qualifier . The column family
prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made
of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns
do not need to be defined at schema time but can be conjured on the fly while the table is up an running.
Physically, all column family members are stored together on the filesystem. Because tunings and storage
specifications are done at the column family level, it is advised that all column family members have the same
general access pattern and size characteristics.

Question : Which feature of HBase ensures predictable disk head seek performance within a RegionServer?

1. Data is stored distributed in HDFS
2. Data stored in HBase is sparse
3. Access Mostly Uused Products by 50000+ Subscribers
4. Data is stored as an uninterpreted array of bytes
Ans : 3

Exp : HBase tables are always sorted by row key.

Question :You perform a check and put operation from within an HBase application using the following:

table.checkAndPut(Bytes.toBytes("rowkey"),
Bytes.toBytes("colfam"),
Bytes.toBytes("qualifier"),
Bytes.toBytes("barvalue"), newrow));

Which describes this check and put operation?

1. Check if rowkey/colfam/qualifier exists and the cell value "barvalue" is equal to newrow. Then return "true".
2. Check if rowkey/colfam/qualifier and the cell value "barvalue" is NOT equal to newrow. Then return "true".
3. Access Mostly Uused Products by 50000+ Subscribers
4. Check if rowkey/colfam/qualifier and has the cell value "barvalue". If so, put the values in newrow and return "true".
Ans : 4

Question : Given that following is your entire dataset:

100 column=Managers:Name,timestamp=13313141762084,value=Steve
100 column=Managers:Salary,timestamp=13313141762086,value=80000
100 column=Managers:Skill_1,timestamp=13313141762089,value=Hadoop
100 column=Managers:Skill_2,timestamp=13313141762092,value=HBase

How many sets of physical files will be read during a scan of the entire dataset immediately following a major
compaction

1. 2
2. 1
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4
Ans : 1

Exp : There are two columns families (Managers and Skills) so there will be two files.
Note:
* Physically, all column family members are stored together on the filesystem. Because tunings and storage
specifications are done at the column family level, it is advised that all column family members have the same
general access pattern and size characteristics.

* HBase currently does not do well with anything above two or three column families so keep the number of
column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if
one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed
though the amount of data they carry is small. When many column families the flushing and compaction
interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and
compaction to work on a per column family basis).

* When changes are made to either Tables or ColumnFamilies (e.g., region size, block size), these changes
take effect the next time there is a major compaction and the StoreFiles get re-written.

* StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. Compression
happens at the block level within StoreFiles.

Question : Your client application if; writing data to a Region. By default, where is the data saved first?

1. StoreFile
2. WAL
3. Access Mostly Uused Products by 50000+ Subscribers
4. Local disk on the RegionServer
Ans : 3

Exp : HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region
server failure, the contents of the memstore are lost because they have not been saved to disk yet.

Reference:HBase data updates are stored in a place in memory called memstore for fast write. In the event of
a region server failure, the contents of the memstore are lost because they have not been saved to disk yet.

Question : You want to do a full table scan on your data. You decide to disable block caching to see if this improves scan
performance. Will disabling block caching improve scan performance?

1. No. Disabling block caching does not improve scan performance.
2. Yes. When you disable block caching, you free up that memory for other operations. With a full table scan,
you cannot take advantage of block caching anyway because your entire table won't fit into cache.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Yes. When you disable block caching, you free up memory for MemStore, which improves, scan performance
Ans : 3

Exp : Disabling Blockcache
Do not turn off block cache (You'd do it by setting hbase.block.cache.size to zero). Currently we do not do well
if you do this because the regionserver will spend all its time loading hfile indicesover and over again. If your
working set it such that block cache does you no good, at least size the block cache such that hfile indices will
stay up in the cache (you can get a rough idea on the size you need by surveying regionserver UIs; you'll see
index block size accounted near the top of the webpage).

Question : Your client application needs to scan s region for the row key value .
Given a store that contains the following list of Row Key values:
100, 101, 102, 103, 104, 105, 106, 107
A bloom filter would return which of the following?

1. Confirmation that 104 may be contained in the set
2. Confirmation that 104 is contained in the set
3. Access Mostly Uused Products by 50000+ Subscribers
4. The file offset of the value 104
Ans : 1
Exp :

* When a HFile is opened, typically when a region is deployed to a RegionServer, the bloom filter is loaded
into memory and used to determine if a given key is in that store file.

* Get/Scan(Row) currently does a parallel N-way get of that Row from all StoreFiles in a Region. This means
that you are doing N read requests from disk. BloomFilters provide a lightweight in- memory structure to
reduce those N disk reads to only the files likely to contain that Row (N-B).

* Keep in mind that HBase only has a block index per file, which is rather course grained and tells the reader
that a key may be in the file because it falls into a start and end key range in the block index. But if the key is
actually present can only be determined by loading that block and scanning it. This also places a burden on
the block cache and you may create a lot of unnecessary churn that the bloom filters would help avoid.

Question : You have Web servers producing timeseries data from Web traffic logs. You want to attain high write
throughput for storing this data in an HBase table. Which of these should you choose for a row key to
maximize your write throughput?

1. {hashCode (centralServerGeneratedSequenceID) }{timestamp}
2. {Long.MAX_VALUE timestamp}
3. Access Mostly Uused Products by 50000+ Subscribers
3. Access Mostly Uused Products by 50000+ Subscribers

Ans : 1
Exp : In the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly) there is a an
optimization note on watching out for a phenomenon where an import process walks in lock- step with all
clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next
region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. The pile-up on a
single region brought on by monotonically increasing keys can be mitigated by randomizing the input records
to not be in sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the
row-key.

Question :What is the advantage of the using the bulk load API over doing individual Puts for bulk insert operations?

1. Writes bypass the HLog/MemStore reducing load on the RegionServer.
2. Users doing bulk Writes may disable writing to the WAL which results in possible data loss.
3. Access Mostly Uused Products by 50000+ Subscribers
4. HFiles written out via the bulk load API are more space efficient than those written out of RegionServers.
Ans : 1

Question : You have data already stored in HDFS and are considering using HBase. Which additional feature does
HBase provide to HDFS?

1. Random writes
2. Batch processing
3. Access Mostly Uused Products by 50000+ Subscribers
4. Scalability

Ans : 1
Exp : HBase adds random read/write access to HDFS.

Note:
Hadoop is scalable, but...
* MapReduce is slow and difcult
* Does not support random writes
* Poor support for random reads

Question : Given that following is your entire dataset:

100 column=Managers:Name,timestamp=13313141762084,value=Steve
100 column=Managers:Salary,timestamp=13313141762086,value=80000
100 column=Managers:Skill_1,timestamp=13313141762089,value=Hadoop
100 column=Managers:Skill_2,timestamp=13313141762092,value=HBase

How many store files will be contained in your region(s) immediately following a major compaction?

1. Four
2. Three
3. Access Mostly Uused Products by 50000+ Subscribers
4. One
Ans : 3
Exp : There are two columns families (Managers and Skills) so there will be two files.

* Physically, all column family members are stored together on the filesystem. Because tunings and storage
specifications are done at the column family level, it is advised that all column family members have the same
general access pattern and size characteristics.

* HBase currently does not do well with anything above two or three column families so keep the number of
column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if
one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed
though the amount of data they carry is small. When many column families the flushing and compaction
interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and
compaction to work on a per column family basis).

* When changes are made to either Tables or ColumnFamilies (e.g., region size, block size), these changes
take effect the next time there is a major compaction and the StoreFiles get re-written.

* StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. Compression
happens at the block level within StoreFiles.

Question : You need to free up disk space on your HBase cluster. You delete all versions of your data that is older than
one week. You notice your delete has had minimal impact on your storage availability.
This is because:

1. You have large store file indexes
2. HBase has not flushed the MemStore
3. Access Mostly Uused Products by 50000+ Subscribers
4. HBase has not run a major compaction
Ans : 4

Exp : The actual deletion of the excess versions is done upon major compaction.

HBase basically never overwrites data but only appends. The data files are rewritten once in while by a
compaction process. A data file is basically a list of key-value pairs, where the key is the composite {row key,
column key, time}. Each time you do a put that writes a new value for an existing cell, a new key-value pair
gets appended to the store. Even if you would specify an existing timestamp. Doing lots of updates to the
same row in a short time span will lead to a lot of key-value pairs being present in the store. Depending on the
garbage collection settings (see next), these will be removed during the next compaction.

Question : Given the following HBase table schema:
Row Key, colFam_A:a, colFam_A:b, colFamB:2, colFam_B:10
A table scan will return the column data in which of the following sorted orders:

1. Row Key, colFam_A:a, colFam_A:b, colFam_B:10, colFam_B:2
2. Row Key, colFam_A:a, colFam_A:b, colFam_B:2, colFam_B:10
3. Access Mostly Uused Products by 50000+ Subscribers
4. Row Key, colFam_A:a, colFam_B:10, colFam_A:b, colFam_B:2
Ans : 1

Exp : All is sorted in hbase, first by row (row key), then by column family followed by column qualifier, type and
finally timestamp (ts is sorted in reverse .. so you see newest records first).

Question : Your HBase cluster has hit a performance wall and doesn't seem to be getting faster as you add
RegionServers. Adding an additional HMaster will:

1. Have no effect on performance.
2. Improve the performance of region writes but decrease the performance of metadata changes
3. Access Mostly Uused Products by 50000+ Subscribers
4. Make the performance problem even worse, as operations will have to be replicated to multiple masters.

Ans : 1

Exp : You can add multiple HBase master nodes; however, only one HBase master node is active at a time. The
active HBase master node changes only when the current active HBase master node is shut down or fails.

HBase clusters expand by adding RegionServers that are hosted on industry-standard servers. If a cluster expands with more RegionServers, it increases both in terms of
storage and as well as processing capacity. When you want to increase HBase write throughput, moving into bigger cluster by adding RegionServers will improvewrite performance.

Question : You have two tables in existing RDBMS. One contains information about the products you sell (name, size,
color, etc.) The other contains images of the products in JPEG format. These tables are frequently joined in
queries to your database. You would like to move this data into HBase.
How would you design the schema?

1. Create two tables each with multiple column families
2. Create two tables each with a single column family
3. Access Mostly Uused Products by 50000+ Subscribers
4. Create a single table with one column family

Ans : 3
Exp : Access patterns are an important factor in HBase schema design. Even though the two tables in this scenario have very different data sizes and formats, it is better to store
them in one table if you are accessing them together most of the time.

Column families allow for separation of data. You can store different types of data and format into different column families. Attributes such as compression, Bloom filters, and
replication are set on per column family basis. In this example, it is better to store product information and product images into two different column families and one table.

HBase Documentation on Column Family: (http://hbase.apache.org/book/columnfamily.html)
especially the part:
"Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that
all column family members have the same general access pattern and size characteristics."

Question : You have a "WebLog" table in HBase. The Row Keys are the IP Addresses. You want to retrieve all entries that have an IP Address of .... The shell
command you would use is:

1. get 'WebLog', '75.67.21.146'
2. scan 'WebLog', '75.67.21.146'
3. Access Mostly Uused Products by 50000+ Subscribers
4. scan 'WebLog', {COLFAM => 'IP', FILTER => '75.67.12.146'}
Ans : 1
Exp : HBase gives you two classes to read data: Get and Scan. The Get class reads data by specifying a single row key and Scan class supports a range scan. In the HBase Shell,
a get operation performs the action on a single row. To geteverything for a row, simply execute a get operation with the row to get.

Further Reading
The HBase Shell wikiincludes a section on scan which includes:
Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp and versions. Examples:

hbase> get 't1', 'r1'
hbase> get 't1', 'r1', {COLUMN => 'c1'}
hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, VERSIONS => 4}

Question : For a given Column Family, you want to always retain at least one version, but expire all other versions that
are older than 5 days. Which of the following Column Family attribute settings would you set to do this?

1. LENGTH = 5, MIN_VERSIONS = 1
2. TTL = 5, MIN_VERSIONS = 1
3. Access Mostly Uused Products by 50000+ Subscribers
4. TTL = 432000, VERSIONS =1
Ans : 3

Exp : * Time To Live (TTL)
ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the
expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded
in the HBase for the row is specified in UTC.

5 days is 43200 (5x24x60x60) seconds

* Minimum Number of Versions
Like maximum number of row versions, the minimum number of row versions to keep is configured per column
family via HColumnDescriptor. The default for min versions is 0, which means the feature is disabled. The
minimum number of row versions parameter is used together with the time-to-live parameter and can be
combined with the number of row versions parameter to allow configurations such as "keep the last T minutes
worth of data, at most N versions, but keep at least M versions around" (where M is the value for minimum
number of row versions, M is less than N). This parameter should only be set when time-to-live is enabled for a column
family and must be less than the number of row versions.

Question : Given that the following is your entire dataset:

100 column=Managers:Name,timestamp=13313141762084,value=Steve
100 column=Managers:Salary,timestamp=13313141762086,value=80000
100 column=Managers:Skill_1,timestamp=13313141762089,value=Hadoop
100 column=Managers:Skill_2,timestamp=13313141762092,value=HBase

How many regions will be read during a scan of the entire dataset?

1. 4
2. 2
3. Access Mostly Uused Products by 50000+ Subscribers
4. 3
Ans : 3
This table consists of one unique rowkey(100) and 2 different column families (Managers, Skills). All data for a given row in the table is managed together in a region.
Region size is configurable between 256Mb to 20Gb. In this example, 4 rows can fit within the lowest default region size, 256 Mb, therefore there is one region for this dataset.

Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects is as follows:
Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
MemStore (MemStore for each Store for each Region for the table)
StoreFile (StoreFiles for each Store for each Region for the table)
Block (Blocks within a StoreFile within a Store for each Region for the table)
Region Size
Determining the "right" region size can be tricky, and there are a few factors to consider:

HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines -
nearly the entire cluster will be idle. This really cant be stressed enough, since a common problem is loading 200MB data into HBase then wondering why your awesome 10 node
cluster isn't doing anything.

On the other hand, high region count has been known to make things slow. This is getting better with each release of HBase, but it is probably better to have 700 regions
than 3000 for the same amount of data.

There is not much memory footprint difference between 1 region and 10 in terms of indexes, etc, held by the RegionServer.

When starting off, it's probably best to stick to the default region-size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over
the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).

Question : You have created a schema called QuickTechie in your HBase, with the following tables.
1. USERPROFILES
2. USERARTICLES
3. Access Mostly Uused Products by 50000+ Subscribers
4. USERGROUPS
5. USERCOMMENT
how many regions will this RegionServers have?

1. It must be equal to five
2. Only one
3. Access Mostly Uused Products by 50000+ Subscribers
4. It would have at least five RegionServes.
Ans : 4 HBase is the Hadoop storage manager that provides low-latency random reads and writes on top of HDFS, and it can handle petabytes of data. One of the interesting
capabilities in HBase is auto-sharding, which simply means that tables are dynamically distributed by the system when they become too large.
The basic unit of horizontal scalability in HBase is called a Region. Regions are a subset of the table's data and they are essentially a contiguous, sorted range of rows
that are stored together. Initially, there is only one region for a table. As shown below, when regions become too large after adding more rows, the region is split into two
at the middle key, creating two roughly equal halves.In HBase the slaves are called Region Servers. Each Region Server is responsible to serve a set of regions, and one Region
(i.e. range of rows) can be served only by one Region Server. The HBase architecture has two main services: HMaster that is responsible to coordinate the cluster and execute
administrative operations, and the HRegionServer responsible for handling a subset of the table's data.HBase clusters expand by adding RegionServers that are hosted on
industry-standard servers (often called commodity hardware). If a cluster expands with more RegionServers, it increases both in terms of storage and processing capacity.
When you want to increase HBase write throughput, moving into bigger cluster by adding RegionServers will improve write performance. HMaster, Region Assignment, and
Balancing As previously mentioned, the HBase Master coordinates the HBase Cluster and is responsible for administrative operations. A Region Server can serve one or more
Regions. Each Region is assigned to a Region Server on startup and the master can decide to move a Region from one Region Server to another as the result of a load balance
operation. The Master also handles Region Server failures by assigning the region to another Region Server. The mapping of Regions and Region Servers is kept in a system
table called META. By reading META, you can identify which region is responsible for your key. This means that for read and write operations, the master is not involved
at all and clients can go directly to the Region Server responsible to serve the requested data. To put or get a row clients don't have to contact the master, clients can
directly contact the Region Server that handles the specified row, or in case of a client scan, can directly contact the set of Region Servers responsible for handling the set of
keys: To identify the Region Server, the client does a query on the META table. META is a system table used to keep track of regions. It contains the server name and a region
identifier comprising a table name and the start row-key. By looking at the start-key and the next region start-key clients are able to identify the range of rows contained in a
a particular region. The client keeps a cache for the region locations. This avoids clients to hit the META table every time an operation on the same region is issued. In case of
a region split or move to another Region Server (due to balancing, or assignment policies), the client will receive an exception as response and the cache will be refreshed by
fetching the updated information from the META table: Since META is a table like the others, the client has to identify on which server META is located. The META locations are
stored in a ZooKeeper node on assignment by the Master, and the client reads directly the node to get the address of the Region Server that contains META. HBase's original design
was based on BigTable, with another table called -ROOT- containing the META locations and Apache ZooKeeper pointing to it. HBase 0.96 removed that arrangement in favor of ZooKeeper
only, since META cannot be split and therefore consists of a single region. Client API: Master and Regions Responsibilities : The HBase Java client API has two main interfaces:
HBaseAdmin allows interaction with the "table schema" by creating/deleting/modifying tables, and it allows interaction with the cluster by assigning/unassigning regions, merging
regions together, calling for a flush, and so on. This interface communicates with the Master. HTable allows the client to manipulate the data of a specified table by using get, put,
delete, and all the other data operations. This interface communicates directly with the Region Servers responsible for handling the requested set of keys. Those two interfaces have
separate responsibilities: HBaseAdmin is only used to execute admin operations and communicate with the Master while the HTable is used to manipulate data and communicate with the
Regions. Conclusion : As we've seen here, having a Master/Slave architecture does not mean that each operation goes through the master. To read and write data, the HBase client, in
fact, goes directly to the specific Region Server responsible for handling the row keys for all the data operations (HTable). The Master is used by the client only for table
creation, modification, and deletion operations (HBaseAdmin). Although the a concept of a Master exists, the HBase client does not depend on it for data operations and the cluster
can keep serving data even if the master goes down. Each table has at least one region, until it splits. If you have five tables stored in HBase, you have at least five regions. You
may have more depending on the size of the region and whether it has split, but you have at least five. HBase documentation on Region : Regions are the basic element of availability
and distribution for tables, and are comprised of a Store per Column Family.

Question : HLog stores all the edits to the HStore. Its the hbase write-ahead-log implementation. It performs logfile-rolling,
so external callers are not aware that the underlying file is being rolled. There is one HLog per RegionServer. All edits for all
Regions carried by a particular RegionServer are entered first in the HLog.Select the order in which data written to the HLog?

1. It can write in any order to give parallelization
2. One batch is first sorted ascending order of row key to give read performance
3. Access Mostly Uused Products by 50000+ Subscribers
4. In maintain the same order as it is written.
Ans : 4
Exp : Log stores all the edits to the HStore. Its the hbase write-ahead-log implementation. It performs logfile-rolling, so external callers are not aware that the underlying file
is being rolled. There is one HLog per RegionServer. All edits for all Regions carried by a particular RegionServer are entered first in the HLog.

Each HRegion is identified by a unique long int. HRegions do not need to declare themselves before using the HLog; they simply include their HRegion-id in the append or completeCacheFlush calls. An HLog consists of multiple on-disk files, which have a chronological order. As data is flushed to other (better) on-disk structures, the log becomes obsolete. We can destroy all the log messages for a given HRegion-id up to the most-recent CACHEFLUSH message from that HRegion.

It's only practical to delete entire files. Thus, we delete an entire on-disk file F when all of the messages in F have a log-sequence-id that's older (smaller) than the
most-recent CACHEFLUSH message for every HRegion that has a message in F. Synchronized methods can never execute in parallel. However, between the start of a cache flush
and the completion point, appends are allowed but log rolling is not. To prevent log rolling taking place during this period, a separate reentrant lock is used.As all
regions on a RegionServer share a single instance of HLog, data is written sequentially for the best I/O performance. It also means that there is overhead if you need
to replay the log and restore data; however this happens infrequently enough to warrant sequential writes for the performance gains the cluster achieves. As we mentioned
in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of
the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored
in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.

A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it
belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region
so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting.
It is a critical process for recovering data if a region server fails.

Log splitting is done by HMaster as the cluster starts or by ServerShutdownHandler as a region server shuts down. Since we need to guarantee consistency,
affected regions are unavailable until data is restored. So we need to recover and replay all WAL edits before letting those regions become available again.
As a result, regions affected by log splitting are unavailable until the process completes and any required edits are applied.

Question : Your Acmeshell Android client application writes to HBase Which sequence will your application traverse
to find the Region serving the row range of interest?
1. ZooKeeper
2. .META.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Region.

1. 2,3,1,4
2. 4,2,3,1
3. Access Mostly Uused Products by 50000+ Subscribers
4. 1,2,4,3
Ans : 3 Catalog Tables The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase shell's list command, but is in fact a table just
like any other. -ROOT-
The -ROOT- table was removed in HBase 0.96.0. Information here should be considered historical.
The -ROOT- table kept track of the location of the .META table (the previous name for the table now called hbase:meta) prior to HBase 0.96. The -ROOT- table
structure was as follows:
Key .META. region key (.META.,,1)
Values info:regioninfo (serialized HRegionInfo instance of hbase:meta)
info:server (server:port of the RegionServer holding hbase:meta)
info:serverstartcode (start-time of the RegionServer process holding hbase:meta)
hbase:meta The hbase:meta table (previously called .META.) keeps a list of all regions in the system. The location of hbase:meta was previously tracked within the -ROOT-
table, but is now stored in Zookeeper.
The hbase:meta table structure is as follows:
Key Region key of the format ([table],[region start key],[region id])
Values info:regioninfo (serialized HRegionInfo instance for this region)
info:server (server:port of the RegionServer containing this region)
info:serverstartcode (start-time of the RegionServer process containing this region)
When a table is in the process of splitting, two other columns will be created, called info:splitA and info:splitB. These columns represent the two daughter regions.
The values for these columns are also serialized HRegionInfo instances. After the region has been split, eventually this row will be deleted.
Note on HRegionInfo The empty key is used to denote table start and table end. A region with an empty start key is the first region in a table. If a region has both an empty start
and an empty end key, it is the only region in the table. In the (hopefully unlikely) event that programmatic processing of catalog metadata is required, see the Writables utility.
Startup Sequencing First, the location of hbase:meta is looked up in Zookeeper. Next, hbase:meta is updated with server and startcode values. When you write to HBase first time,
you need to find the regions where your new row will be written.
To locate the RegionServer, HBase keeps two catalog tables that most users will never have to touch, -ROOT-and .META.. .META. holds the location of the regions of all the tables.
-ROOT- holds the location of .META.. When the RegionServer crashes, you won't know where to find the region, so you look into a cache location for .META.. If you can not find in
.Meta. then look into -ROOT-. This is reason why HBase stores the location of -ROOT- in a znode in a ZooKeeper. The correct order for pre .96 versions of HBase is
ZooKeeper -> -ROOT- -> .META. -> RegionServer -> Region.
The correct order for post .96 versions fo HBase is ZooKeeper -> .META. -> RegionServer -> Region.

The general flow is that a new client contacts the Zookeeper quorum (a separate cluster of Zookeeper nodes) first to find a particular row key. It does so by retrieving the
server name (i.e. host name) that hosts the -ROOT- region from Zookeeper. With that information it can query that server to get the server that hosts the .META. table.
Both of these two details are cached and only looked up once. Lastly it can query the .META. server and retrieve the server that has the row the client is looking for.
Once it has been told where the row resides, i.e. in what region, it caches this information as well and contacts the HRegionServer hosting that region directly.
So over time the client has a pretty complete picture of where to get rows from without needing to query the .META. server again.

Question : You are having an Android based client application what information does the .META. table provide to Acmeshell App ?

1. It provides the pointer to -ROOT-

2. It provides the ZooKeeper znode contents
3. Access Mostly Uused Products by 50000+ Subscribers
4. Region location

Ans : 4
Exp : Catalog Tables The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase shell's list command, but is in fact a table just like any other.
-ROOT-
The -ROOT- table was removed in HBase 0.96.0. Information here should be considered historical.
The -ROOT- table kept track of the location of the .META table (the previous name for the table now called hbase:meta) prior to HBase 0.96. The -ROOT- table structure was
as follows:
Key .META. region key (.META.,,1)
Values info:regioninfo (serialized HRegionInfo instance of hbase:meta)
info:server (server:port of the RegionServer holding hbase:meta)
info:serverstartcode (start-time of the RegionServer process holding hbase:meta)
hbase:meta
The hbase:meta table (previously called .META.) keeps a list of all regions in the system. The location of hbase:meta was previously tracked within the
-ROOT- table, but is now stored in Zookeeper.
The hbase:meta table structure is as follows:
Key Region key of the format ([table],[region start key],[region id])
Values info:regioninfo (serialized HRegionInfo instance for this region)
info:server (server:port of the RegionServer containing this region)
info:serverstartcode (start-time of the RegionServer process containing this region)
When a table is in the process of splitting, two other columns will be created, called info:splitA and info:splitB. These columns represent the two daughter regions.
The values for these columns are also serialized HRegionInfo instances. After the region has been split, eventually this row will be deleted.
Note on HRegionInfo

The empty key is used to denote table start and table end. A region with an empty start key is the first region in a table. If a region has both an empty start and
an empty end key, it is the only region in the table

In the (hopefully unlikely) event that programmatic processing of catalog metadata is required, see the Writables utility.
Startup Sequencing First, the location of hbase:meta is looked up in Zookeeper. Next, hbase:meta is updated with server and startcode values.
When you write to HBase first time, you need to find the regions where your new row will be written.
To locate the RegionServer, HBase keeps two catalog tables that most users will never have to touch, -ROOT-and .META.. .META. holds the location of the regions of all
the tables. -ROOT- holds the location of .META.. When the RegionServer crashes, you won't know where to find the region, so you look into a cache location for .META..
If you can not find in .Meta. then look into -ROOT-. This is reason why HBase stores the location of -ROOT- in a znode in a ZooKeeper. The correct order for pre .96
versions of HBase is ZooKeeper -> -ROOT- -> .META. -> RegionServer -> Region.
The correct order for post .96 versions fo HBase is ZooKeeper -> .META. -> RegionServer -> Region.

The general flow is that a new client contacts the Zookeeper quorum (a separate cluster of Zookeeper nodes) first to find a particular row key. It does so by retrieving the
server name (i.e. host name) that hosts the -ROOT- region from Zookeeper. With that information it can query that server to get the server that hosts the .META. table. Both
of these two details are cached and only looked up once. Lastly it can query the .META. server and retrieve the server that has the row the client is looking for. Once it
has been told where the row resides, i.e. in what region, it caches this information as well and contacts the HRegionServer hosting that region directly. So over time the
client has a pretty complete picture of where to get rows from without needing to query the .META. server again.The .META. table keeps a list of all regions in the system:
the current location of each region for each table, and also its name, its HRegionInfo, and some server information.

Question : You are migrating the data from Oracle database to HBase, with the Put commands. And you observed that
your migration is very slow. You contacted the HBase administrator and he suggesred to
use bulk load API over executing Put commands for bulk insert operations, why ?

1. There is a possibility of loosing the data if one of the datanode fails in Hadoop
2. Bulk load API will by pass the NameNode and directly writes the data to Data Node and later inform to NameNode via Heartbeat
3. Access Mostly Uused Products by 50000+ Subscribers
4. Writes would be bypassed the HLog and there will be less load on the RegionServer

Ans : 4
Exp : HBase includes several methods of loading data into tables. The most straightforward method is to either use the TableOutputFormat class from a MapReduce job, or
use the normal client APIs; however, these are not always the most efficient methods.

The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the generated StoreFiles into a running cluster.
Using bulk load will use less CPU and network resources than simply using the HBase API.
Bulk Load Limitations : As bulk loading bypasses the write path, the WAL doesn't get written to as part of the process. Replication works by reading the WAL files so it won't
see the bulk loaded data – and the same goes for the edits that use Put.setWriteToWAL(true). One way to handle that is to ship the raw files or the HFiles to the other
cluster and do the other processing there.
Bulk Load Architecture : The HBase bulk load process consists of two main steps.
Preparing data via a MapReduce job : The first step of a bulk load is to generate HBase data files (StoreFiles) from a MapReduce job using HFileOutputFormat. This output
format writes out data in HBase's internal storage format so that they can be later loaded very efficiently into the cluster.

In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will
be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the
regions in the table. HFileOutputFormat includes a convenience function, configureIncrementalLoad(), which automatically sets up a TotalOrderPartitioner based on the
current region boundaries of a table. Completing the data load : After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using completebulkload.
This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate Region Server
which adopts the HFile, moving it into its storage directory and making the data available to clients.

If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, the completebulkloads utility will
automatically split the data files into pieces corresponding to the new boundaries. This process is not optimally efficient, so users should take care to minimize the
delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means.
Importing the prepared data using the completebulkload tool : After a data import has been prepared, either by using the importtsv tool with the "importtsv.bulk.output"
option or by some other MapReduce job using the HFileOutputFormat, the completebulkload tool is used to import the data into the running cluster.

The completebulkload tool simply takes the output path where importtsv or your MapReduce job put its results, and the table name to import into. For example:
$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
The -c config-file option can be used to specify a file containing the appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on the CLASSPATH
(In addition, the CLASSPATH must contain the directory that has the zookeeper configuration file if zookeeper is NOT managed by HBase).
Note: If the target table does not already exist in HBase, this tool will create the table automatically.
This tool will run quickly, after which point the new data will be visible in the cluster.
This question is specific to bulk loading with the HBase bulk load API, not using external tools. The first step of a bulk load is to generate HBase data files
from a MapReduce job using HFileOutputFormat. This output format writes out data in HBase's internal storage format so that it can be later loaded very efficiently
into the cluster. In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region.

Question : You have a Acmeshell advertising application which connect to HBase and inserts the users comment to
AcmeLogs table, and the table uses random number as the rowkey, but later you get to know it has no business use using random number as a rowkey.
Now you decided to use emailid as a rowkey, select the correct statement.

1. Using HBase shell command you can change the rowkeys

2. Simply ALTER the TABLE and change the rowkeys
3. Access Mostly Uused Products by 50000+ Subscribers
4. You have decided to keep both the table till complete migration happens. And modified the client code which now will write to both the old table
and a new table while migrating the old data in the background.
Ans : 4 Exp : In HBase, rowkeys are immutable. If you change the row key, you have to rewrite the entire table. Typically, you create a new table with new rowkeys
and migrate the old data into new table in the background via map-reduce job or tools.
Monotonically Increasing Row Keys/Timeseries Data

In the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import
process walks in lock-step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc.
With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotonically increasing row keys are
problematic in BigTable-like datastores: monotonically increasing values are bad. The pile-up on a single region brought on by monotonically increasing keys
can be mitigated by randomizing the input records to not be in sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2, 3)
as the row-key.

If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page describing the schema it uses in HBase.
The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a
timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or
hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various
points of regions in the table.
Try to minimize row and column sizes or why are my StoreFile indices large?
In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it'll be accompanied by its row, column name, and timestamp -
always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios.
One such is the case described by Marc Limotte at the tail of HBASE-3551 (recommended!). Therein, the indices that are kept on HBase storefiles
(Section 9.7.6.4, "StoreFile (HFile)") to facilitate random access may end up occupyng large chunks of the HBase allotted RAM because the cell value
coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval
or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices. See the thread a question
storefileIndexSize up on the user mailing list.

Most of the time small inefficiencies don't matter all that much. Unfortunately, this is a case where they do. Whatever patterns are selected for
ColumnFamilies, attributes, and rowkeys they could be repeated several billion times in your data.

Question : You have a USERPROFILE table in HBase. The row keys are dates as to You would like to retrieve
all rows between 01012012 and 31122012 (1 year data) . Select the proper command for doing such operation

1. scan 'USERPROFILE', {STARTROW => '01011990', STOPROW => '31122012'}
2. get 'USERPROFILE', {STARTROW => '01011990', STOPROW => '31122012'}

3. Access Mostly Uused Products by 50000+ Subscribers
4. scan 'USERPROFILE', {STARTROW => '01011990', LIMIT => '31122012'}
Ans : 1 Exp : HBase gives you two classes to read data: Get and Scan. The Get class reads data by specifying a single row key and Scan class supports a range scan.
In the HBase Shell, you can type scan commands to retrieve range of rows. All operations are identical to get commands with the exception of instantiation.
Rather than specifying a single row, an optional startRow and stopRow may be defined. If rows are not specified, the scan will iterate over all rows. Scan a table;
pass table name and optionally a dictionary of scanner specifications. Scanner specifications may include one or more of the following: LIMIT, STARTROW, STOPROW, TIMESTAMP,
or COLUMNS. If no columns are specified, all columns will be scanned. To scan all members of a column family, leave the qualifier empty as in 'col_family:'. Examples: hbase>
scan '.META.' hbase> scan '.META.', {COLUMNS => 'info:regioninfo'} hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'} For experts,
there is an additional option -- CACHE_BLOCKS -- which switches block caching for the scanner on (true) or off (false). By default it is enabled. Examples: hbase>
scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false} One interesting piece of information not mentioned there is that version and column delete markers
are ordered in line together with the KeyValues that they affect and family delete markers are always sorted to the beginning of their row. Generally each
column family is represented by a Store, which manages one or more StoreFiles. Scanning is a form of merge-sort performed by a RegionScanner, which merges
results of one or more StoreScanners (one per family), who in turn merge results of one or more StoreFileScanners (one for each file for this family):
RegionScanner
/ \
StoreScanner StoreScanner
/ \ / \
StoreFileScanner StoreFileScanner StoreFileScanner StoreFileScanner
| | | |
StoreFile StoreFile StoreFile StoreFile
Say you performed the following actions (T is time):
put: row1, family, col1, value1, T --> delete family: row1, family, T+1 --> put: row1, family, col1, value2, T+2 --> delete columns: row1, family,
col1, T+3 --> put: row1, family, col1, value3, T+4 What we will find in the StoreFile for "family" is this:
family-delete row1, T+1 , row1,col1,value3, T+4 , column-delete row1,col1, T+3 , row1,col1,value2, T+2 , row1,col1,value1, T
KeyValues are ordered in reverse chronological order (within their row and column). The family delete marker, however, is always first on the row. That makes sense,
because family delete marker affects potentially many columns in this row, so in order to allow scanners to scan forward-only, the family delete markers need to be
seen by a scanner first. That also means that even if we are only looking for a specific column, we always seek to the beginning of the row to check if there is a
family delete with a timestamp that is greater of equal to the versions of column that we are interested in. After that the scanner seeks to the column. And even
if we are looking for a past version of a column we have to seek to the "beginning" of the column (i.e. the potential delete marker right before it), before we
can scan forward to the version we're interested in. My initial patch for HBASE-5268 would sort the prefix delete markers just like column delete markers.
By now it should be obvious why this does not work. The beginning of a row is a known point, so it the "beginning" of a column. The beginning of a prefix of a column
is not. So to find out whether a column is marked for deletion we would have to start at the beginning of the row and then scan forward to find all prefix delete markers.
That clearly is not efficient. My 2nd attempt placed the all prefix delete markers at the beginning of the row. That technically works. But notice that a column delete
marker only has to be retained by the scanner for a short period of time (until after we scanned past all versions that it affects). For prefix delete markers
we'd have to keep them into memory until we scanned past all columns that start with the prefix. In addition the number of prefix delete markers for a row is not
principally limited. Family delete markers do not have this problem because (1) the number of column families is limited for other reasons and (2) store files are per family,
so all we have to remember for a family in a StoreScanner is a timestamp.

Question : Increasing the value of hbase.hregion.max.filesize setting wiil affect?

1. Automated region splitting

2. DataStore Volume capacity will be increased
3. Access Mostly Uused Products by 50000+ Subscribers
4. Blcok Size will be increased
Ans : 1
Exp : Regardless of whether pre-splitting is used or not, once a region gets to a certain limit, it is automatically split into two regions. If you are using HBase 0.94
(which comes with HDP-1.2), you can configure when HBase decides to split a region, and how it calculates the split points via the pluggable RegionSplitPolicy API.
There are a couple predefined region split policies: ConstantSizeRegionSplitPolicy, IncreasingToUpperBoundRegionSplitPolicy, and KeyPrefixRegionSplitPolicy.
The first one is the default and only split policy for HBase versions before 0.94. It splits the regions when the total data size for one of the stores (corresponding to a
column-family) in the region gets bigger than configured "hbase.hregion.max.filesize", which has a default value of 10GB. This split policy is ideal in cases, where you are
have done pre-splitting, and are interested in getting lower number of regions per region server. The default split policy for HBase 0.94 and trunk is
IncreasingToUpperBoundRegionSplitPolicy, which does more aggressive splitting based on the number of regions hosted in the same region server. The split policy uses the max
store file size based on Min (R^2 * "hbase.hregion.memstore.flush.size", "hbase.hregion.max.filesize"), where R is the number of regions of the same table hosted on the same
regionserver. So for example, with the default memstore flush size of 128MB and the default max store size of 10GB, the first region on the region server will be split just
after the first flush at 128MB. As number of regions hosted in the region server increases, it will use increasing split sizes: 512MB, 1152MB, 2GB, 3.2GB, 4.6GB, 6.2GB, etc.
After reaching 9 regions, the split size will go beyond the configured "hbase.hregion.max.filesize", at which point, 10GB split size will be used from then on. For both of
these algorithms, regardless of when splitting occurs, the split point used is the rowkey that corresponds to the mid point in the "block index" for the largest store file
in the largest store. KeyPrefixRegionSplitPolicy is a curious addition to the HBase arsenal. You can configure the length of the prefix for your row keys for grouping them,
and this split policy ensures that the regions are not split in the middle of a group of rows having the same prefix. If you have set prefixes for your keys, then you can
use this split policy to ensure that rows having the same rowkey prefix always end up in the same region. This grouping of records is sometimes referred to as "Entity Groups"
or "Row Groups". This is a key feature when considering use of the "local transactions" (alternative link) feature in your application design. You can configure the default
split policy to be used by setting the configuration "hbase.regionserver.region.split.policy", or by configuring the table descriptor.Once a region gets to a certain limit
(configurable), it is automatically split into two regions. It splits the regions when the total data size for one of the stores (corresponding to a column-family) in the region gets
bigger than configuredhbase.hregion.max.filesize. Bigger Regions : Consider going to larger regions to cut down on the total number of regions on your cluster. Generally less Regions
to manage makes for a smoother running cluster (You can always later manually split the big Regions should one prove hot and you want to spread the request load over the cluster). A
lower number of regions is preferred, generally in the range of 20 to low-hundreds per RegionServer. Adjust the regionsize as appropriate to achieve this number. For the 0.90.x
codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb. For 0.92.x codebase, due to the HFile v2 change much larger regionsizes can be supported (e.g., 20Gb).
You may need to experiment with this setting based on your hardware configuration and application needs. Adjust hbase.hregion.max.filesize in your hbase-site.xml. RegionSize can also
be set on a per-table basis via HTableDescriptor. Managed Splitting : Rather than let HBase auto-split your Regions, manage the splitting manually . With growing amounts of data,
splits will continually be needed. Since you always know exactly what regions you have, long-term debugging and profiling is much easier with manual splits. It is hard to trace the
logs to understand region level problems if it keeps splitting and getting renamed. Data offlining bugs + unknown number of split regions == oh crap! If an HLog or StoreFile was
mistakenly unprocessed by HBase due to a weird bug and you notice it a day or so later, you can be assured that the regions specified in these files are the same as the current
regions and you have less headaches trying to restore/replay your dat1. You can finely tune your compaction algorithm. With roughly uniform data growth, it's easy to cause split /
compaction storms as the regions all roughly hit the same data size at the same time. With manual splits, you can let staggered, time-based major compactions spread out your network
IO load. How do I turn off automatic splitting? Automatic splitting is determined by the configuration valuehbase.hregion.max.filesize. It is not recommended that you set this to
Long.MAX_VALUE in case you forget about manual splits. A suggested setting is 100GB, which would result in > 1hr major compactions if reached.

Question : In a ZooKeeper ensemble of nodes. How many simultaneous ZooKeeper nodes failure can be handled?

1. 1
2. 2
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4
Ans : 3 Exp : For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. As long as a majority of the ensemble are up, the service will be
available. Because Zookeeper requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single
machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines.For reliable
ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. As long as a majority of the ensemble are up, the service will be available. Because Zookeeper
requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the
remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines.A distributed Apache HBase installation depends on a
running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble. Apache HBase by default manages a ZooKeeper "cluster" for
you. It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process. You can also manage the ZooKeeper ensemble independent of HBase and just point HBase at
the cluster it should use. To toggle HBase management of ZooKeeper, use the HBASE_MANAGES_ZK variable in conf/hbase-env.sh. This variable, which defaults to true, tells HBase whether
to start/stop the ZooKeeper ensemble servers as part of HBase start/stop. When HBase manages the ZooKeeper ensemble, you can specify ZooKeeper configuration using its native zoo.cfg
file, or, the easier option is to just specify ZooKeeper options directly in conf/hbase-site.xml. A ZooKeeper configuration option can be set as a property in the HBase
hbase-site.xml XML configuration file by prefacing the ZooKeeper option name with hbase.zookeeper.property. For example, the clientPort setting in ZooKeeper can be changed by setting
the hbase.zookeeper.property.clientPort property. For all default values used by HBase, including ZooKeeper configuration, see HBase Default Configuration. Look for the
hbase.zookeeper.property prefix You must at least list the ensemble servers in hbase-site.xml using the hbase.zookeeper.quorum property. This property defaults to a single ensemble
member at localhost which is not suitable for a fully distributed HBase. (It binds to the local machine only and remote clients will not be able to connect). In order to form a
proper ZooKeeper quorum, you need at least 3. Therefore, a ZooKeeper ensemble of 5 allows 2 peers to fail. How many ZooKeepers should I run : You can run a ZooKeeper ensemble that
comprises 1 node only but in production it is recommended that you run a ZooKeeper ensemble of 3, 5 or 7 machines; the more members an ensemble has, the more tolerant the ensemble is
of host failures. Further, you should run an odd number of machines. In ZooKeeper, an even number of peers is supported, but it is normally not used because an even-sized ensemble
requires, proportionally, more peers to form a quorum than an odd sized ensemble requires. For example, an ensemble with 4 peers requires 3 to form a quorum, while an ensemble with 5
also requires 3 to form a quorum. Thus, an ensemble of 5 allows 2 peers to fail, and thus is more fault tolerant than the ensemble of 4, which allows only 1 down peer. As long as a
majority of the ensemble are up, the ZooKeeper service will be available. Because It requires a majority, it is best to use an odd number of machines. For example, with four
machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines
ZooKeeper can handle the failure of two machines It's critical that you run ZooKeeper under supervision, since Zookeeper is fail-fast and will exit the process if it encounters any
error case. See here for more details. The ZooKeeper Data Directory contains files which are a persistent copy of the znodes stored by a particular serving ensemble. It's snapshot
files. As changes are made to the znodes these changes are appended to a transaction log, occasionally, when a log grows large, a snapshot of the current state of all znodes will be
written to the filesystem. This snapshot supercedes all previous logs ZooKeeper's transaction log must be on a dedicated device. (A dedicated partition is not enough.) ZooKeeper
writes the log sequentially, without seeking Sharing your log device with other processes can cause seeks and contention, which in turn can cause multi-second delays
Do not put ZooKeeper in a situation that can cause a swap. In order for ZooKeeper to function with any sort of timeliness, it simply cannot be allowed to swap. Therefore, make
certain that the maximum heap size given to ZooKeeper is not bigger than the amount of real memory available to ZooKeeper.

Question : When do you get error with message "Too many open files" ( java.io.IOException), while writing to HBase.

1. When there is a available memory is very low and not able to open an HBase data file.
2. When there is not enough space on one of the datanode
3. Access Mostly Uused Products by 50000+ Subscribers
4. System do not allow more than 10 files for read and write in case of HBase
Ans : 3
Exp : Limits on Number of Files and Processes (ulimit)
Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open
to 1024 (or 256 on older versions of OS X). You can check this limit on your servers by running the command ulimit -n when logged in as the user which runs HBase. See Section
15.9.2.2, java.io.IOException...(Too many open files) for some of the problems you may experience if the limit is too low. You may also notice errors such as the following:
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
It is recommended to raise the ulimit to at least 10,000, but more likely 10,240, because the value is usually expressed in multiples of 1024. Each ColumnFamily has at least one
StoreFile, and possibly more than 6 StoreFiles if the region is under load. The number of open files required depends upon the number of ColumnFamilies and the number of regions.
The following is a rough formula for calculating the potential number of open files on a RegionServer.

Example Calculate the Potential Number of Open Files
(StoreFiles per ColumnFamily) x (regions per RegionServer)
For example, assuming that a schema had 3 ColumnFamilies per region with an average of 3 StoreFiles per ColumnFamily, and there are 100 regions per RegionServer, the JVM will open
3 * 3 * 100 = 900 file descriptors, not counting open JAR files, configuration files, and others. Opening a file does not take many resources, and the risk of allowing a user to
open too many files is minimal.
Another related setting is the number of processes a user is allowed to run at once. In Linux and Unix, the number of processes is set using the ulimit -u command. This should
not be confused with the nproc command, which controls the number of CPUs available to a given user. Under load, a nproc that is too low can cause OutOfMemoryError exceptions.
See Jack Levin's major hdfs issues thread on the hbase-users mailing list, from 2011.
Configuring the fmaximum number of ile descriptors and processes for the user who is running the HBase process is an operating system configuration, rather than an HBase
configuration. It is also important to be sure that the settings are changed for the user that actually runs HBase. To see which user started HBase, and that user's ulimit
configuration, look at the first line of the HBase log for that instance.[2]
ulimit Settings on Ubuntu. To configure ulimit settings on Ubuntu, edit /etc/security/limits.conf, which is a space-delimited file with four columns. Refer to the man page for
limits.conf for details about the format of this file. In the following example, the first line sets both soft and hard limits for the number of open files (nofile) to 32768 for
the operating system user with the username hadoop. The second line sets the number of processes to 32000 for the same user.
hadoop - nofile 32768
hadoop - nproc 32000
The settings are only applied if the Pluggable Authentication Module (PAM) environment is directed to use them. To configure PAM to use these limits, be sure that the
/etc/pam.d/common-session file contains the following line:
session required pam_limits.so
This exception states that the system can't create a new file because there are too many open files. You can increase the maximum number of open files in the system with the ulimit command.

Question : Which action will help to improve random reading performance ?

1. Increase the blocksize

2. Decrease the blocksize
3. Access Mostly Uused Products by 50000+ Subscribers

4. Increasing the number of partitions in the table
Ans : 2 Exp : Decreasing blocksize will improve random reads on small cell values like the 100 bytes in this item.
ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the Section
Block Cache, but it is not a guarantee that the entire table will be in memory. ColumnFamily BlockSize : The blocksize can be configured for each ColumnFamily in a table, and this
defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is
doubled then the resulting indexes should be roughly halved). Blocks are used for different things in HDFS and HBase. Blocks in HDFS are the unit of storage on disk. Blocks in HBase
are a unit of storage for memory. There are many HBase blocks that fit into a single HBase file. HBase is designed to maximize efficiency from the HDFS file system, and they fully
use the block size there. Some people have even tuned their HDFS to have 20GB block sizes to make HBase more efficient.
If you have perfectly random access on a table that is much larger than memory, then the HBase cache will not help you. However, since HBase is intelligent in how it stores and
retrieves data, it does not need to read an entire file block from HDFS to get at the data needed for a request. Data is indexed by key, and it is efficient to retrieve.
Additionally, if you have designed your keys well to distribute data across your cluster, random reads will read equally from every server, so that the overall throughput is
maximized.

Question :

Please select the correct staement for Hbase

1. HBase is a Distributed Database
2. HBase is Hadoop Application
3. Access Mostly Uused Products by 50000+ Subscribers
4. HBase should be used when you require realtime read or write random access to data
5. 1,2 and 4 are correct
Ans : 5
Exp : HBase is a distributed column oriented database build on top of HDFS and it is a Hadoop Application, which can be used when the requirement is to realtime read or write
random access to very large datasets.

Question : HBase is not a relational database but support SQL

1. 1. True
2. 2. False
Ans : 2

Exp : Hbase is not a relational database and even it does not support SQL

Question :

In HBase data cells how versioning is implemented

1. HBase create unique cell id based on some numeric and character
2. Versioning is done based on Hash Key of the Cell value
3. Access Mostly Uused Products by 50000+ Subscribers
4. There is no versioning support in HBase

Ans : 3
Exp :HBase version Cell values and their versioning is maintained is on Current Timestamp which is auto assigned by HBase at the time of Cell Insertion

Question :

In HBase a cell content is an UnInterpreted array of Bytes

1. True
2. False

Ans : 1

Question :

Tables rows are sorted by table's Primary Key

1. True
2. False
Ans : 1

Exp : Tables rows are sorted by Row Key the tables primary key and the sorting is Byte Ordered. All table access are by the table Prmary Key

Question : You will use MapR-DB JSON-based document database model for _________________ data formats.
A. hierarchical
B. nested
C. varying
D. evolving

1. A,B,C
2. B,C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. A,B,C,D
5. A,C,E

Question : Select the correct statement regarding MapR-DB scalling

A. Thousands of nodes in a cluster
B. Trillions of tables
C. Trillions of records per table
D. 64 column families per table
E. Cell/document sizes up to 2 GB

1. A,B,C
2. B,C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. A,D,E
5. A,B,C,D.E