Question : Which one of the client interface to HBase you can use, so you will have functionality needed to store and retrive data from HBase as well as delete obsolete values
Explanation: HTable is the class in org.apache.hadoop.hbase.client package , which provides the user with all the functionality needed to store and retriev data from HBase as well as delete obsolete values and so on.
Question : Which is/are the statement correct..
A. All operations taht mutate data are gaurnteed to be atomic on a per-row-basis B. A reading client will not be affected by another updating a particular row C. Many client can update the same row at same time 1. A,C correct 2. A,B correct 3. Access Mostly Uused Products by 50000+ Subscribers 4. All A,B,C are correct
Answer with explanation:
2 is the best answer.
In HBase data are gaurnteed to be atomic on a per-row-basis a. This affects all other concurrent readers and writers of that same row.
However, it should be avoided to many clients update the same row at the same time.
Question : Select the correct statement..
A. Create HTable instances only once, usually when your application start B. Create a separate HTable instance for every thread you execute or use HTablePool C. Updates are atomic on per row basis
Creating HTable instances are costly, which involves scanning the .META table to check if the table actually exists and if it is enabled. As well as fet other operations that make this call quite costly.
Hence it is recommended that you create HTable instance only once - and one per thread and reuse that instance for the rest of the lifetime of you client application.
1. You can not add new colum to existing column family. 2. You need to empty the column family and add new column and repopulate the data. 3. Access Mostly Uused Products by 50000+ Subscribers 4. None of the above are correct Ans : 3
Exp : A table's column families must be specified upfront as part of table schema definition. However, new column family members can be added on demand.
Question :
Physically all column family members are stored separately if they have more than three columns ?
1. True 2. Flase Ans : 2
Exp : Physically all column family members are stored together on the file system.
Question :
Which is the correct statement in case of HBase database 1. In HBase table rows are sorted 2. Columns can be added on the fly if column family pre-exist 3. Access Mostly Uused Products by 50000+ Subscribers 4. All of the above Ans : 4
Question : Which of the following is valid syntex to represent singke column family.
Ans : 1 Exp : Row columns are grouped in a column families all column family members have a common prefix, so for example the column employee:name,employee:salary,employee:age are all mebers of employee column family, and options 2 and 3 ae not valid syntex
Question : Initiallly a table comprises of two regions one started filling first and second for reservation
1. True 2. False Ans : 2
Exp : Initially a table comprises a single regionbut the size of the region grows, after it crosses a configurable size threshold, it split at row boundary into two new regions of approximately equal size.
Question : Which one is the correct statement for HBase regions. 1. Tables are automatically partitioned horizontally by HBase into regions 2. Regions are the unit that get distributed over on HBase cluster 3. Access Mostly Uused Products by 50000+ Subscribers
1. 1 and 2 are correct 2. 2 and 3 Are correct 3 1 and 3 are correct 4. 1 ,2 and 3 are correct
Ans : 4
Question : There is a column family of employee, current value in a row one (john,$,) : (name,salary,age). Now your one application is updating the salary to $5500 and at the same time application two is fecthing salary (Same monment), what value will be return to application two
Question : Completebulkload Tool is used import the data into the running cluster
1. True 2. False Ans : 1
Question : You have one primary HMaster and one standby. Your primary HMaster Falls fails and your client application needs to make a metadata change. Which of the following is the effect on your client application?
1. The client will query ZooKeeper to find the location of the new HMaster and complete the metadata change. 2. The client will make the metadata change regardless of the slate of the HMaster. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The client application will fail with a runtime error. Ans : 1
Exp: The HBase master publishes its location to clients via Zookeeper. This is done to support multimaster operation (failover). So if the HBase master self-discovers its location as a localhost address, then it will publish that. Region servers or clients which go to Zookeeper for the master location will get back an address in that case only useful if they happen to be co-located with the master. Note: * HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes.
Question : You have an average key-value pair size of bytes. Your primary access is random needs on the table. Which of the following actions will speed up random reading performance on your cluster?
Exp : Larger block size is preferred if files are primarily for sequential access. Smaller blocks are good for random access, but require more memory to hold the block index, and may be slower to create
Question : The cells in a given row have versions that range from to . You execute a delete specifying the value 3000 for the version. What is the outcome?
Exp : When performing a delete operation in HBase, there are two ways to specify the versions to be deleted
Delete all versions older than a certain timestamp
Delete the version at a specific timestamp
A delete can apply to a complete row, a complete column family, or to just one column. It is only in the last case that you can delete explicit versions. For the deletion of a row or all the columns within a family, it always works by deleting all cells older than a certain version.
Deletes work by creating tombstone markers. For example, let's suppose we want to delete a row. For this you can specify a version, or else by default the currentTimeMillis is used. What this means is "delete all cells where the version is less than or equal to this version". HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values[17]. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted
Question : You have an "Employees" table in HBase. The Row Keys are the employees' IDs. You would like to retrieve all employees who have an employee ID between 'user_100' and 'user_110'. The shell command you would use to complete this is:
Exp : public Scan(byte[] startRow, byte[] stopRow) Create a Scan operation for the range of rows specified. Parameters: startRow - row to start scanner at or after (inclusive) stopRow - row to stop scanner before (exclusive) Reference:o rg.apache.hadoop.hbase.client, Class Scan
Question : Under default settings, which feature of HBase ensures that data won't be lost in the event of a RegionServer failure?
1. All HBase activity is written to the WAL, which is stored in HDFS 2. All operations are logged on the HMaster 3. Access Mostly Uused Products by 50000+ Subscribers 4. Data is stored on the local filesystem of the RegionServer. Ans : 1 Exp : HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.
Question : You want to store clickstream data in HBase. Your data consists of the following: the source id, the name of the cluster, the URL of the click, the timestamp for each click
Which rowkey would you use if you wanted to retrieve the source ids with a scan and sorted with the most recent first?
Ans : 3 The Hbase master server creates the zookeeper znode /hbase . This is then used for hbase daemons to coordinate. Even the name of the active Hbase master is stored here. If the hbase master dies, the backup hbase master overwrites the contents of the znode so clients and region servers know about the new master. Apart from this, region info is maintained in zookeeper znodes as well. * Multi-master feature introduced in 0.20.0 does not add cooperating Masters; there is still just one working Master while the other backups wait. For example, if you start 200 Masters only 1 will be active while the others wait for it to die. The switch usually takes zookeeper.session.timeout plus a couple of seconds to occur.
you can just start up a second Master on some other machine and it automatically becomes the backup master. This functionality is built in to HBase and the Zookeeper quorum takes care of leader election. Basically the first HBase master to start up becomes the "Active" master and any other masters that you start up become "Backup". Should your active master die, ZK will detect this and make your backup master become the active one. This could take as long as the ZK session timeout to happen, plus some time for the backup master to do some housekeeping, but all the data that a Master operates on is persisted in ZK and HDFS, so this failover is seamless.
/hbase/master (zookeeper.znode.master) The ?active? master will register its own address in this znode at startup, making this znode the source of truth for identifying which server is the Master.
/hbase/backup-masters (zookeeper.znode.backup.masters) Each inactive Master will register itself as backup Master by creating a sub-znode (hbase/backup-master/m1.host). This znode is mainly used to track which machines are available to replace the Master in case of failure.
Question : Data is written to the HLog in which of the following orders?
Question : Your application needs to retrieve to non-sequential rows from a table with one billion rows. You know the rowkey of each of the rows you need to retrieve. Which does your application need to implement?
Question : You have a table with the following rowkeys: r1, r2, r3, r10, r15, r20, r25, r30, r35 In which order will these rows be retrieved from a scan?
Exp : If you can have the table receiving rows always in decreasing order of the row keys, you then have easy access to the first and last rows. This is possible because HBase tables are always sorted by row key.
Question : You need to create a "WebLogs" table in HBase. The table will consist of a single Column Family called "Errors" and two column qualifiers, "IP" and "URL". The shell command you should use to create the table is:
Ans : 3 Columns in Apache HBase are grouped into column families.
All column members of a column family have the same prefix.
For example, the columns courses:history and courses:math are both members of the courses column family.
The colon character (:) delimits the column family from the column qualifier . The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up an running. Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.
Question : Which feature of HBase ensures predictable disk head seek performance within a RegionServer?
1. Check if rowkey/colfam/qualifier exists and the cell value "barvalue" is equal to newrow. Then return "true". 2. Check if rowkey/colfam/qualifier and the cell value "barvalue" is NOT equal to newrow. Then return "true". 3. Access Mostly Uused Products by 50000+ Subscribers 4. Check if rowkey/colfam/qualifier and has the cell value "barvalue". If so, put the values in newrow and return "true". Ans : 4
Question : Given that following is your entire dataset:
Exp : There are two columns families (Managers and Skills) so there will be two files. Note: * Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.
* HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and compaction to work on a per column family basis).
* When changes are made to either Tables or ColumnFamilies (e.g., region size, block size), these changes take effect the next time there is a major compaction and the StoreFiles get re-written.
* StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. Compression happens at the block level within StoreFiles.
Question : Your client application if; writing data to a Region. By default, where is the data saved first?
Exp : HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet.
Reference:HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet.
Question : You want to do a full table scan on your data. You decide to disable block caching to see if this improves scan performance. Will disabling block caching improve scan performance?
1. No. Disabling block caching does not improve scan performance. 2. Yes. When you disable block caching, you free up that memory for other operations. With a full table scan, you cannot take advantage of block caching anyway because your entire table won't fit into cache. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Yes. When you disable block caching, you free up memory for MemStore, which improves, scan performance Ans : 3
Exp : Disabling Blockcache Do not turn off block cache (You'd do it by setting hbase.block.cache.size to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indicesover and over again. If your working set it such that block cache does you no good, at least size the block cache such that hfile indices will stay up in the cache (you can get a rough idea on the size you need by surveying regionserver UIs; you'll see index block size accounted near the top of the webpage).
Question : Your client application needs to scan s region for the row key value . Given a store that contains the following list of Row Key values: 100, 101, 102, 103, 104, 105, 106, 107 A bloom filter would return which of the following?
* When a HFile is opened, typically when a region is deployed to a RegionServer, the bloom filter is loaded into memory and used to determine if a given key is in that store file.
* Get/Scan(Row) currently does a parallel N-way get of that Row from all StoreFiles in a Region. This means that you are doing N read requests from disk. BloomFilters provide a lightweight in- memory structure to reduce those N disk reads to only the files likely to contain that Row (N-B).
* Keep in mind that HBase only has a block index per file, which is rather course grained and tells the reader that a key may be in the file because it falls into a start and end key range in the block index. But if the key is actually present can only be determined by loading that block and scanning it. This also places a burden on the block cache and you may create a lot of unnecessary churn that the bloom filters would help avoid.
Question : You have Web servers producing timeseries data from Web traffic logs. You want to attain high write throughput for storing this data in an HBase table. Which of these should you choose for a row key to maximize your write throughput?
Ans : 1 Exp : In the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock- step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. The pile-up on a single region brought on by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.
Question :What is the advantage of the using the bulk load API over doing individual Puts for bulk insert operations?
1. Writes bypass the HLog/MemStore reducing load on the RegionServer. 2. Users doing bulk Writes may disable writing to the WAL which results in possible data loss. 3. Access Mostly Uused Products by 50000+ Subscribers 4. HFiles written out via the bulk load API are more space efficient than those written out of RegionServers. Ans : 1
Question : You have data already stored in HDFS and are considering using HBase. Which additional feature does HBase provide to HDFS?
* Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.
* HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and compaction to work on a per column family basis).
* When changes are made to either Tables or ColumnFamilies (e.g., region size, block size), these changes take effect the next time there is a major compaction and the StoreFiles get re-written.
* StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. Compression happens at the block level within StoreFiles.
Question : You need to free up disk space on your HBase cluster. You delete all versions of your data that is older than one week. You notice your delete has had minimal impact on your storage availability. This is because:
Exp : The actual deletion of the excess versions is done upon major compaction.
HBase basically never overwrites data but only appends. The data files are rewritten once in while by a compaction process. A data file is basically a list of key-value pairs, where the key is the composite {row key, column key, time}. Each time you do a put that writes a new value for an existing cell, a new key-value pair gets appended to the store. Even if you would specify an existing timestamp. Doing lots of updates to the same row in a short time span will lead to a lot of key-value pairs being present in the store. Depending on the garbage collection settings (see next), these will be removed during the next compaction.
Question : Given the following HBase table schema: Row Key, colFam_A:a, colFam_A:b, colFamB:2, colFam_B:10 A table scan will return the column data in which of the following sorted orders:
Exp : All is sorted in hbase, first by row (row key), then by column family followed by column qualifier, type and finally timestamp (ts is sorted in reverse .. so you see newest records first).
Question : Your HBase cluster has hit a performance wall and doesn't seem to be getting faster as you add RegionServers. Adding an additional HMaster will:
1. Have no effect on performance. 2. Improve the performance of region writes but decrease the performance of metadata changes 3. Access Mostly Uused Products by 50000+ Subscribers 4. Make the performance problem even worse, as operations will have to be replicated to multiple masters.
Ans : 1
Exp : You can add multiple HBase master nodes; however, only one HBase master node is active at a time. The active HBase master node changes only when the current active HBase master node is shut down or fails.
HBase clusters expand by adding RegionServers that are hosted on industry-standard servers. If a cluster expands with more RegionServers, it increases both in terms of storage and as well as processing capacity. When you want to increase HBase write throughput, moving into bigger cluster by adding RegionServers will improvewrite performance.
Question : You have two tables in existing RDBMS. One contains information about the products you sell (name, size, color, etc.) The other contains images of the products in JPEG format. These tables are frequently joined in queries to your database. You would like to move this data into HBase. How would you design the schema?
Ans : 3 Exp : Access patterns are an important factor in HBase schema design. Even though the two tables in this scenario have very different data sizes and formats, it is better to store them in one table if you are accessing them together most of the time.
Column families allow for separation of data. You can store different types of data and format into different column families. Attributes such as compression, Bloom filters, and replication are set on per column family basis. In this example, it is better to store product information and product images into two different column families and one table.
HBase Documentation on Column Family: (http://hbase.apache.org/book/columnfamily.html) especially the part: "Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics."
Question : You have a "WebLog" table in HBase. The Row Keys are the IP Addresses. You want to retrieve all entries that have an IP Address of .... The shell command you would use is:
1. get 'WebLog', '75.67.21.146' 2. scan 'WebLog', '75.67.21.146' 3. Access Mostly Uused Products by 50000+ Subscribers 4. scan 'WebLog', {COLFAM => 'IP', FILTER => '75.67.12.146'} Ans : 1 Exp : HBase gives you two classes to read data: Get and Scan. The Get class reads data by specifying a single row key and Scan class supports a range scan. In the HBase Shell, a get operation performs the action on a single row. To geteverything for a row, simply execute a get operation with the row to get.
Further Reading The HBase Shell wikiincludes a section on scan which includes: Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp and versions. Examples:
hbase> get 't1', 'r1' hbase> get 't1', 'r1', {COLUMN => 'c1'} hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']} hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1} hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, VERSIONS => 4}
Question : For a given Column Family, you want to always retain at least one version, but expire all other versions that are older than 5 days. Which of the following Column Family attribute settings would you set to do this?
Exp : * Time To Live (TTL) ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC.
5 days is 43200 (5x24x60x60) seconds
* Minimum Number of Versions Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via HColumnDescriptor. The default for min versions is 0, which means the feature is disabled. The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the number of row versions parameter to allow configurations such as "keep the last T minutes worth of data, at most N versions, but keep at least M versions around" (where M is the value for minimum number of row versions, M is less than N). This parameter should only be set when time-to-live is enabled for a column family and must be less than the number of row versions.
Question : Given that the following is your entire dataset:
How many regions will be read during a scan of the entire dataset?
1. 4 2. 2 3. Access Mostly Uused Products by 50000+ Subscribers 4. 3 Ans : 3 This table consists of one unique rowkey(100) and 2 different column families (Managers, Skills). All data for a given row in the table is managed together in a region. Region size is configurable between 256Mb to 20Gb. In this example, 4 rows can fit within the lowest default region size, 256 Mb, therefore there is one region for this dataset.
Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects is as follows: Table (HBase table) Region (Regions for the table) Store (Store per ColumnFamily for each Region for the table) MemStore (MemStore for each Store for each Region for the table) StoreFile (StoreFiles for each Store for each Region for the table) Block (Blocks within a StoreFile within a Store for each Region for the table) Region Size Determining the "right" region size can be tricky, and there are a few factors to consider:
HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle. This really cant be stressed enough, since a common problem is loading 200MB data into HBase then wondering why your awesome 10 node cluster isn't doing anything.
On the other hand, high region count has been known to make things slow. This is getting better with each release of HBase, but it is probably better to have 700 regions than 3000 for the same amount of data.
There is not much memory footprint difference between 1 region and 10 in terms of indexes, etc, held by the RegionServer.
When starting off, it's probably best to stick to the default region-size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).
Question : You have created a schema called QuickTechie in your HBase, with the following tables. 1. USERPROFILES 2. USERARTICLES 3. Access Mostly Uused Products by 50000+ Subscribers 4. USERGROUPS 5. USERCOMMENT how many regions will this RegionServers have? 1. It must be equal to five 2. Only one 3. Access Mostly Uused Products by 50000+ Subscribers 4. It would have at least five RegionServes. Ans : 4 HBase is the Hadoop storage manager that provides low-latency random reads and writes on top of HDFS, and it can handle petabytes of data. One of the interesting capabilities in HBase is auto-sharding, which simply means that tables are dynamically distributed by the system when they become too large. The basic unit of horizontal scalability in HBase is called a Region. Regions are a subset of the table's data and they are essentially a contiguous, sorted range of rows that are stored together. Initially, there is only one region for a table. As shown below, when regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves.In HBase the slaves are called Region Servers. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server. The HBase architecture has two main services: HMaster that is responsible to coordinate the cluster and execute administrative operations, and the HRegionServer responsible for handling a subset of the table's data.HBase clusters expand by adding RegionServers that are hosted on industry-standard servers (often called commodity hardware). If a cluster expands with more RegionServers, it increases both in terms of storage and processing capacity. When you want to increase HBase write throughput, moving into bigger cluster by adding RegionServers will improve write performance. HMaster, Region Assignment, and Balancing As previously mentioned, the HBase Master coordinates the HBase Cluster and is responsible for administrative operations. A Region Server can serve one or more Regions. Each Region is assigned to a Region Server on startup and the master can decide to move a Region from one Region Server to another as the result of a load balance operation. The Master also handles Region Server failures by assigning the region to another Region Server. The mapping of Regions and Region Servers is kept in a system table called META. By reading META, you can identify which region is responsible for your key. This means that for read and write operations, the master is not involved at all and clients can go directly to the Region Server responsible to serve the requested data. To put or get a row clients don't have to contact the master, clients can directly contact the Region Server that handles the specified row, or in case of a client scan, can directly contact the set of Region Servers responsible for handling the set of keys: To identify the Region Server, the client does a query on the META table. META is a system table used to keep track of regions. It contains the server name and a region identifier comprising a table name and the start row-key. By looking at the start-key and the next region start-key clients are able to identify the range of rows contained in a a particular region. The client keeps a cache for the region locations. This avoids clients to hit the META table every time an operation on the same region is issued. In case of a region split or move to another Region Server (due to balancing, or assignment policies), the client will receive an exception as response and the cache will be refreshed by fetching the updated information from the META table: Since META is a table like the others, the client has to identify on which server META is located. The META locations are stored in a ZooKeeper node on assignment by the Master, and the client reads directly the node to get the address of the Region Server that contains META. HBase's original design was based on BigTable, with another table called -ROOT- containing the META locations and Apache ZooKeeper pointing to it. HBase 0.96 removed that arrangement in favor of ZooKeeper only, since META cannot be split and therefore consists of a single region. Client API: Master and Regions Responsibilities : The HBase Java client API has two main interfaces: HBaseAdmin allows interaction with the "table schema" by creating/deleting/modifying tables, and it allows interaction with the cluster by assigning/unassigning regions, merging regions together, calling for a flush, and so on. This interface communicates with the Master. HTable allows the client to manipulate the data of a specified table by using get, put, delete, and all the other data operations. This interface communicates directly with the Region Servers responsible for handling the requested set of keys. Those two interfaces have separate responsibilities: HBaseAdmin is only used to execute admin operations and communicate with the Master while the HTable is used to manipulate data and communicate with the Regions. Conclusion : As we've seen here, having a Master/Slave architecture does not mean that each operation goes through the master. To read and write data, the HBase client, in fact, goes directly to the specific Region Server responsible for handling the row keys for all the data operations (HTable). The Master is used by the client only for table creation, modification, and deletion operations (HBaseAdmin). Although the a concept of a Master exists, the HBase client does not depend on it for data operations and the cluster can keep serving data even if the master goes down. Each table has at least one region, until it splits. If you have five tables stored in HBase, you have at least five regions. You may have more depending on the size of the region and whether it has split, but you have at least five. HBase documentation on Region : Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family.
Question : HLog stores all the edits to the HStore. Its the hbase write-ahead-log implementation. It performs logfile-rolling, so external callers are not aware that the underlying file is being rolled. There is one HLog per RegionServer. All edits for all Regions carried by a particular RegionServer are entered first in the HLog.Select the order in which data written to the HLog? 1. It can write in any order to give parallelization 2. One batch is first sorted ascending order of row key to give read performance 3. Access Mostly Uused Products by 50000+ Subscribers 4. In maintain the same order as it is written. Ans : 4 Exp : Log stores all the edits to the HStore. Its the hbase write-ahead-log implementation. It performs logfile-rolling, so external callers are not aware that the underlying file is being rolled. There is one HLog per RegionServer. All edits for all Regions carried by a particular RegionServer are entered first in the HLog.
Each HRegion is identified by a unique long int. HRegions do not need to declare themselves before using the HLog; they simply include their HRegion-id in the append or completeCacheFlush calls. An HLog consists of multiple on-disk files, which have a chronological order. As data is flushed to other (better) on-disk structures, the log becomes obsolete. We can destroy all the log messages for a given HRegion-id up to the most-recent CACHEFLUSH message from that HRegion.
It's only practical to delete entire files. Thus, we delete an entire on-disk file F when all of the messages in F have a log-sequence-id that's older (smaller) than the most-recent CACHEFLUSH message for every HRegion that has a message in F. Synchronized methods can never execute in parallel. However, between the start of a cache flush and the completion point, appends are allowed but log rolling is not. To prevent log rolling taking place during this period, a separate reentrant lock is used.As all regions on a RegionServer share a single instance of HLog, data is written sequentially for the best I/O performance. It also means that there is overhead if you need to replay the log and restore data; however this happens infrequently enough to warrant sequential writes for the performance gains the cluster achieves. As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.
A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.
Log splitting is done by HMaster as the cluster starts or by ServerShutdownHandler as a region server shuts down. Since we need to guarantee consistency, affected regions are unavailable until data is restored. So we need to recover and replay all WAL edits before letting those regions become available again. As a result, regions affected by log splitting are unavailable until the process completes and any required edits are applied.
Question : Your Acmeshell Android client application writes to HBase Which sequence will your application traverse to find the Region serving the row range of interest? 1. ZooKeeper 2. .META. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Region. 1. 2,3,1,4 2. 4,2,3,1 3. Access Mostly Uused Products by 50000+ Subscribers 4. 1,2,4,3 Ans : 3 Catalog Tables The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase shell's list command, but is in fact a table just like any other. -ROOT- The -ROOT- table was removed in HBase 0.96.0. Information here should be considered historical. The -ROOT- table kept track of the location of the .META table (the previous name for the table now called hbase:meta) prior to HBase 0.96. The -ROOT- table structure was as follows: Key .META. region key (.META.,,1) Values info:regioninfo (serialized HRegionInfo instance of hbase:meta) info:server (server:port of the RegionServer holding hbase:meta) info:serverstartcode (start-time of the RegionServer process holding hbase:meta) hbase:meta The hbase:meta table (previously called .META.) keeps a list of all regions in the system. The location of hbase:meta was previously tracked within the -ROOT- table, but is now stored in Zookeeper. The hbase:meta table structure is as follows: Key Region key of the format ([table],[region start key],[region id]) Values info:regioninfo (serialized HRegionInfo instance for this region) info:server (server:port of the RegionServer containing this region) info:serverstartcode (start-time of the RegionServer process containing this region) When a table is in the process of splitting, two other columns will be created, called info:splitA and info:splitB. These columns represent the two daughter regions. The values for these columns are also serialized HRegionInfo instances. After the region has been split, eventually this row will be deleted. Note on HRegionInfo The empty key is used to denote table start and table end. A region with an empty start key is the first region in a table. If a region has both an empty start and an empty end key, it is the only region in the table. In the (hopefully unlikely) event that programmatic processing of catalog metadata is required, see the Writables utility. Startup Sequencing First, the location of hbase:meta is looked up in Zookeeper. Next, hbase:meta is updated with server and startcode values. When you write to HBase first time, you need to find the regions where your new row will be written. To locate the RegionServer, HBase keeps two catalog tables that most users will never have to touch, -ROOT-and .META.. .META. holds the location of the regions of all the tables. -ROOT- holds the location of .META.. When the RegionServer crashes, you won't know where to find the region, so you look into a cache location for .META.. If you can not find in .Meta. then look into -ROOT-. This is reason why HBase stores the location of -ROOT- in a znode in a ZooKeeper. The correct order for pre .96 versions of HBase is ZooKeeper -> -ROOT- -> .META. -> RegionServer -> Region. The correct order for post .96 versions fo HBase is ZooKeeper -> .META. -> RegionServer -> Region.
The general flow is that a new client contacts the Zookeeper quorum (a separate cluster of Zookeeper nodes) first to find a particular row key. It does so by retrieving the server name (i.e. host name) that hosts the -ROOT- region from Zookeeper. With that information it can query that server to get the server that hosts the .META. table. Both of these two details are cached and only looked up once. Lastly it can query the .META. server and retrieve the server that has the row the client is looking for. Once it has been told where the row resides, i.e. in what region, it caches this information as well and contacts the HRegionServer hosting that region directly. So over time the client has a pretty complete picture of where to get rows from without needing to query the .META. server again.
Question : You are having an Android based client application what information does the .META. table provide to Acmeshell App ? 1. It provides the pointer to -ROOT-
Ans : 4 Exp : Catalog Tables The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase shell's list command, but is in fact a table just like any other. -ROOT- The -ROOT- table was removed in HBase 0.96.0. Information here should be considered historical. The -ROOT- table kept track of the location of the .META table (the previous name for the table now called hbase:meta) prior to HBase 0.96. The -ROOT- table structure was as follows: Key .META. region key (.META.,,1) Values info:regioninfo (serialized HRegionInfo instance of hbase:meta) info:server (server:port of the RegionServer holding hbase:meta) info:serverstartcode (start-time of the RegionServer process holding hbase:meta) hbase:meta The hbase:meta table (previously called .META.) keeps a list of all regions in the system. The location of hbase:meta was previously tracked within the -ROOT- table, but is now stored in Zookeeper. The hbase:meta table structure is as follows: Key Region key of the format ([table],[region start key],[region id]) Values info:regioninfo (serialized HRegionInfo instance for this region) info:server (server:port of the RegionServer containing this region) info:serverstartcode (start-time of the RegionServer process containing this region) When a table is in the process of splitting, two other columns will be created, called info:splitA and info:splitB. These columns represent the two daughter regions. The values for these columns are also serialized HRegionInfo instances. After the region has been split, eventually this row will be deleted. Note on HRegionInfo
The empty key is used to denote table start and table end. A region with an empty start key is the first region in a table. If a region has both an empty start and an empty end key, it is the only region in the table
In the (hopefully unlikely) event that programmatic processing of catalog metadata is required, see the Writables utility. Startup Sequencing First, the location of hbase:meta is looked up in Zookeeper. Next, hbase:meta is updated with server and startcode values. When you write to HBase first time, you need to find the regions where your new row will be written. To locate the RegionServer, HBase keeps two catalog tables that most users will never have to touch, -ROOT-and .META.. .META. holds the location of the regions of all the tables. -ROOT- holds the location of .META.. When the RegionServer crashes, you won't know where to find the region, so you look into a cache location for .META.. If you can not find in .Meta. then look into -ROOT-. This is reason why HBase stores the location of -ROOT- in a znode in a ZooKeeper. The correct order for pre .96 versions of HBase is ZooKeeper -> -ROOT- -> .META. -> RegionServer -> Region. The correct order for post .96 versions fo HBase is ZooKeeper -> .META. -> RegionServer -> Region.
The general flow is that a new client contacts the Zookeeper quorum (a separate cluster of Zookeeper nodes) first to find a particular row key. It does so by retrieving the server name (i.e. host name) that hosts the -ROOT- region from Zookeeper. With that information it can query that server to get the server that hosts the .META. table. Both of these two details are cached and only looked up once. Lastly it can query the .META. server and retrieve the server that has the row the client is looking for. Once it has been told where the row resides, i.e. in what region, it caches this information as well and contacts the HRegionServer hosting that region directly. So over time the client has a pretty complete picture of where to get rows from without needing to query the .META. server again.The .META. table keeps a list of all regions in the system: the current location of each region for each table, and also its name, its HRegionInfo, and some server information.
Question : You are migrating the data from Oracle database to HBase, with the Put commands. And you observed that your migration is very slow. You contacted the HBase administrator and he suggesred to use bulk load API over executing Put commands for bulk insert operations, why ? 1. There is a possibility of loosing the data if one of the datanode fails in Hadoop 2. Bulk load API will by pass the NameNode and directly writes the data to Data Node and later inform to NameNode via Heartbeat 3. Access Mostly Uused Products by 50000+ Subscribers 4. Writes would be bypassed the HLog and there will be less load on the RegionServer
Ans : 4 Exp : HBase includes several methods of loading data into tables. The most straightforward method is to either use the TableOutputFormat class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods.
The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API. Bulk Load Limitations : As bulk loading bypasses the write path, the WAL doesn't get written to as part of the process. Replication works by reading the WAL files so it won't see the bulk loaded data – and the same goes for the edits that use Put.setWriteToWAL(true). One way to handle that is to ship the raw files or the HFiles to the other cluster and do the other processing there. Bulk Load Architecture : The HBase bulk load process consists of two main steps. Preparing data via a MapReduce job : The first step of a bulk load is to generate HBase data files (StoreFiles) from a MapReduce job using HFileOutputFormat. This output format writes out data in HBase's internal storage format so that they can be later loaded very efficiently into the cluster.
In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table. HFileOutputFormat includes a convenience function, configureIncrementalLoad(), which automatically sets up a TotalOrderPartitioner based on the current region boundaries of a table. Completing the data load : After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using completebulkload. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate Region Server which adopts the HFile, moving it into its storage directory and making the data available to clients.
If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, the completebulkloads utility will automatically split the data files into pieces corresponding to the new boundaries. This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means. Importing the prepared data using the completebulkload tool : After a data import has been prepared, either by using the importtsv tool with the "importtsv.bulk.output" option or by some other MapReduce job using the HFileOutputFormat, the completebulkload tool is used to import the data into the running cluster.
The completebulkload tool simply takes the output path where importtsv or your MapReduce job put its results, and the table name to import into. For example: $ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable The -c config-file option can be used to specify a file containing the appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on the CLASSPATH (In addition, the CLASSPATH must contain the directory that has the zookeeper configuration file if zookeeper is NOT managed by HBase). Note: If the target table does not already exist in HBase, this tool will create the table automatically. This tool will run quickly, after which point the new data will be visible in the cluster. This question is specific to bulk loading with the HBase bulk load API, not using external tools. The first step of a bulk load is to generate HBase data files from a MapReduce job using HFileOutputFormat. This output format writes out data in HBase's internal storage format so that it can be later loaded very efficiently into the cluster. In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region.
Question : You have a Acmeshell advertising application which connect to HBase and inserts the users comment to AcmeLogs table, and the table uses random number as the rowkey, but later you get to know it has no business use using random number as a rowkey. Now you decided to use emailid as a rowkey, select the correct statement. 1. Using HBase shell command you can change the rowkeys
2. Simply ALTER the TABLE and change the rowkeys 3. Access Mostly Uused Products by 50000+ Subscribers 4. You have decided to keep both the table till complete migration happens. And modified the client code which now will write to both the old table and a new table while migrating the old data in the background. Ans : 4 Exp : In HBase, rowkeys are immutable. If you change the row key, you have to rewrite the entire table. Typically, you create a new table with new rowkeys and migrate the old data into new table in the background via map-reduce job or tools. Monotonically Increasing Row Keys/Timeseries Data
In the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores: monotonically increasing values are bad. The pile-up on a single region brought on by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.
If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page describing the schema it uses in HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table. Try to minimize row and column sizes or why are my StoreFile indices large? In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it'll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of HBASE-3551 (recommended!). Therein, the indices that are kept on HBase storefiles (Section 9.7.6.4, "StoreFile (HFile)") to facilitate random access may end up occupyng large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices. See the thread a question storefileIndexSize up on the user mailing list.
Most of the time small inefficiencies don't matter all that much. Unfortunately, this is a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated several billion times in your data.
Question : You have a USERPROFILE table in HBase. The row keys are dates as to You would like to retrieve all rows between 01012012 and 31122012 (1 year data) . Select the proper command for doing such operation 1. scan 'USERPROFILE', {STARTROW => '01011990', STOPROW => '31122012'} 2. get 'USERPROFILE', {STARTROW => '01011990', STOPROW => '31122012'}
3. Access Mostly Uused Products by 50000+ Subscribers 4. scan 'USERPROFILE', {STARTROW => '01011990', LIMIT => '31122012'} Ans : 1 Exp : HBase gives you two classes to read data: Get and Scan. The Get class reads data by specifying a single row key and Scan class supports a range scan. In the HBase Shell, you can type scan commands to retrieve range of rows. All operations are identical to get commands with the exception of instantiation. Rather than specifying a single row, an optional startRow and stopRow may be defined. If rows are not specified, the scan will iterate over all rows. Scan a table; pass table name and optionally a dictionary of scanner specifications. Scanner specifications may include one or more of the following: LIMIT, STARTROW, STOPROW, TIMESTAMP, or COLUMNS. If no columns are specified, all columns will be scanned. To scan all members of a column family, leave the qualifier empty as in 'col_family:'. Examples: hbase> scan '.META.' hbase> scan '.META.', {COLUMNS => 'info:regioninfo'} hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'} For experts, there is an additional option -- CACHE_BLOCKS -- which switches block caching for the scanner on (true) or off (false). By default it is enabled. Examples: hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false} One interesting piece of information not mentioned there is that version and column delete markers are ordered in line together with the KeyValues that they affect and family delete markers are always sorted to the beginning of their row. Generally each column family is represented by a Store, which manages one or more StoreFiles. Scanning is a form of merge-sort performed by a RegionScanner, which merges results of one or more StoreScanners (one per family), who in turn merge results of one or more StoreFileScanners (one for each file for this family): RegionScanner / \ StoreScanner StoreScanner / \ / \ StoreFileScanner StoreFileScanner StoreFileScanner StoreFileScanner | | | | StoreFile StoreFile StoreFile StoreFile Say you performed the following actions (T is time): put: row1, family, col1, value1, T --> delete family: row1, family, T+1 --> put: row1, family, col1, value2, T+2 --> delete columns: row1, family, col1, T+3 --> put: row1, family, col1, value3, T+4 What we will find in the StoreFile for "family" is this: family-delete row1, T+1 , row1,col1,value3, T+4 , column-delete row1,col1, T+3 , row1,col1,value2, T+2 , row1,col1,value1, T KeyValues are ordered in reverse chronological order (within their row and column). The family delete marker, however, is always first on the row. That makes sense, because family delete marker affects potentially many columns in this row, so in order to allow scanners to scan forward-only, the family delete markers need to be seen by a scanner first. That also means that even if we are only looking for a specific column, we always seek to the beginning of the row to check if there is a family delete with a timestamp that is greater of equal to the versions of column that we are interested in. After that the scanner seeks to the column. And even if we are looking for a past version of a column we have to seek to the "beginning" of the column (i.e. the potential delete marker right before it), before we can scan forward to the version we're interested in. My initial patch for HBASE-5268 would sort the prefix delete markers just like column delete markers. By now it should be obvious why this does not work. The beginning of a row is a known point, so it the "beginning" of a column. The beginning of a prefix of a column is not. So to find out whether a column is marked for deletion we would have to start at the beginning of the row and then scan forward to find all prefix delete markers. That clearly is not efficient. My 2nd attempt placed the all prefix delete markers at the beginning of the row. That technically works. But notice that a column delete marker only has to be retained by the scanner for a short period of time (until after we scanned past all versions that it affects). For prefix delete markers we'd have to keep them into memory until we scanned past all columns that start with the prefix. In addition the number of prefix delete markers for a row is not principally limited. Family delete markers do not have this problem because (1) the number of column families is limited for other reasons and (2) store files are per family, so all we have to remember for a family in a StoreScanner is a timestamp.
Question : Increasing the value of hbase.hregion.max.filesize setting wiil affect? 1. Automated region splitting
2. DataStore Volume capacity will be increased 3. Access Mostly Uused Products by 50000+ Subscribers 4. Blcok Size will be increased Ans : 1 Exp : Regardless of whether pre-splitting is used or not, once a region gets to a certain limit, it is automatically split into two regions. If you are using HBase 0.94 (which comes with HDP-1.2), you can configure when HBase decides to split a region, and how it calculates the split points via the pluggable RegionSplitPolicy API. There are a couple predefined region split policies: ConstantSizeRegionSplitPolicy, IncreasingToUpperBoundRegionSplitPolicy, and KeyPrefixRegionSplitPolicy. The first one is the default and only split policy for HBase versions before 0.94. It splits the regions when the total data size for one of the stores (corresponding to a column-family) in the region gets bigger than configured "hbase.hregion.max.filesize", which has a default value of 10GB. This split policy is ideal in cases, where you are have done pre-splitting, and are interested in getting lower number of regions per region server. The default split policy for HBase 0.94 and trunk is IncreasingToUpperBoundRegionSplitPolicy, which does more aggressive splitting based on the number of regions hosted in the same region server. The split policy uses the max store file size based on Min (R^2 * "hbase.hregion.memstore.flush.size", "hbase.hregion.max.filesize"), where R is the number of regions of the same table hosted on the same regionserver. So for example, with the default memstore flush size of 128MB and the default max store size of 10GB, the first region on the region server will be split just after the first flush at 128MB. As number of regions hosted in the region server increases, it will use increasing split sizes: 512MB, 1152MB, 2GB, 3.2GB, 4.6GB, 6.2GB, etc. After reaching 9 regions, the split size will go beyond the configured "hbase.hregion.max.filesize", at which point, 10GB split size will be used from then on. For both of these algorithms, regardless of when splitting occurs, the split point used is the rowkey that corresponds to the mid point in the "block index" for the largest store file in the largest store. KeyPrefixRegionSplitPolicy is a curious addition to the HBase arsenal. You can configure the length of the prefix for your row keys for grouping them, and this split policy ensures that the regions are not split in the middle of a group of rows having the same prefix. If you have set prefixes for your keys, then you can use this split policy to ensure that rows having the same rowkey prefix always end up in the same region. This grouping of records is sometimes referred to as "Entity Groups" or "Row Groups". This is a key feature when considering use of the "local transactions" (alternative link) feature in your application design. You can configure the default split policy to be used by setting the configuration "hbase.regionserver.region.split.policy", or by configuring the table descriptor.Once a region gets to a certain limit (configurable), it is automatically split into two regions. It splits the regions when the total data size for one of the stores (corresponding to a column-family) in the region gets bigger than configuredhbase.hregion.max.filesize. Bigger Regions : Consider going to larger regions to cut down on the total number of regions on your cluster. Generally less Regions to manage makes for a smoother running cluster (You can always later manually split the big Regions should one prove hot and you want to spread the request load over the cluster). A lower number of regions is preferred, generally in the range of 20 to low-hundreds per RegionServer. Adjust the regionsize as appropriate to achieve this number. For the 0.90.x codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb. For 0.92.x codebase, due to the HFile v2 change much larger regionsizes can be supported (e.g., 20Gb). You may need to experiment with this setting based on your hardware configuration and application needs. Adjust hbase.hregion.max.filesize in your hbase-site.xml. RegionSize can also be set on a per-table basis via HTableDescriptor. Managed Splitting : Rather than let HBase auto-split your Regions, manage the splitting manually . With growing amounts of data, splits will continually be needed. Since you always know exactly what regions you have, long-term debugging and profiling is much easier with manual splits. It is hard to trace the logs to understand region level problems if it keeps splitting and getting renamed. Data offlining bugs + unknown number of split regions == oh crap! If an HLog or StoreFile was mistakenly unprocessed by HBase due to a weird bug and you notice it a day or so later, you can be assured that the regions specified in these files are the same as the current regions and you have less headaches trying to restore/replay your dat1. You can finely tune your compaction algorithm. With roughly uniform data growth, it's easy to cause split / compaction storms as the regions all roughly hit the same data size at the same time. With manual splits, you can let staggered, time-based major compactions spread out your network IO load. How do I turn off automatic splitting? Automatic splitting is determined by the configuration valuehbase.hregion.max.filesize. It is not recommended that you set this to Long.MAX_VALUE in case you forget about manual splits. A suggested setting is 100GB, which would result in > 1hr major compactions if reached.
Question : In a ZooKeeper ensemble of nodes. How many simultaneous ZooKeeper nodes failure can be handled? 1. 1 2. 2 3. Access Mostly Uused Products by 50000+ Subscribers 4. 4 Ans : 3 Exp : For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. As long as a majority of the ensemble are up, the service will be available. Because Zookeeper requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines.For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. As long as a majority of the ensemble are up, the service will be available. Because Zookeeper requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines.A distributed Apache HBase installation depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble. Apache HBase by default manages a ZooKeeper "cluster" for you. It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process. You can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use. To toggle HBase management of ZooKeeper, use the HBASE_MANAGES_ZK variable in conf/hbase-env.sh. This variable, which defaults to true, tells HBase whether to start/stop the ZooKeeper ensemble servers as part of HBase start/stop. When HBase manages the ZooKeeper ensemble, you can specify ZooKeeper configuration using its native zoo.cfg file, or, the easier option is to just specify ZooKeeper options directly in conf/hbase-site.xml. A ZooKeeper configuration option can be set as a property in the HBase hbase-site.xml XML configuration file by prefacing the ZooKeeper option name with hbase.zookeeper.property. For example, the clientPort setting in ZooKeeper can be changed by setting the hbase.zookeeper.property.clientPort property. For all default values used by HBase, including ZooKeeper configuration, see HBase Default Configuration. Look for the hbase.zookeeper.property prefix You must at least list the ensemble servers in hbase-site.xml using the hbase.zookeeper.quorum property. This property defaults to a single ensemble member at localhost which is not suitable for a fully distributed HBase. (It binds to the local machine only and remote clients will not be able to connect). In order to form a proper ZooKeeper quorum, you need at least 3. Therefore, a ZooKeeper ensemble of 5 allows 2 peers to fail. How many ZooKeepers should I run : You can run a ZooKeeper ensemble that comprises 1 node only but in production it is recommended that you run a ZooKeeper ensemble of 3, 5 or 7 machines; the more members an ensemble has, the more tolerant the ensemble is of host failures. Further, you should run an odd number of machines. In ZooKeeper, an even number of peers is supported, but it is normally not used because an even-sized ensemble requires, proportionally, more peers to form a quorum than an odd sized ensemble requires. For example, an ensemble with 4 peers requires 3 to form a quorum, while an ensemble with 5 also requires 3 to form a quorum. Thus, an ensemble of 5 allows 2 peers to fail, and thus is more fault tolerant than the ensemble of 4, which allows only 1 down peer. As long as a majority of the ensemble are up, the ZooKeeper service will be available. Because It requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines It's critical that you run ZooKeeper under supervision, since Zookeeper is fail-fast and will exit the process if it encounters any error case. See here for more details. The ZooKeeper Data Directory contains files which are a persistent copy of the znodes stored by a particular serving ensemble. It's snapshot files. As changes are made to the znodes these changes are appended to a transaction log, occasionally, when a log grows large, a snapshot of the current state of all znodes will be written to the filesystem. This snapshot supercedes all previous logs ZooKeeper's transaction log must be on a dedicated device. (A dedicated partition is not enough.) ZooKeeper writes the log sequentially, without seeking Sharing your log device with other processes can cause seeks and contention, which in turn can cause multi-second delays Do not put ZooKeeper in a situation that can cause a swap. In order for ZooKeeper to function with any sort of timeliness, it simply cannot be allowed to swap. Therefore, make certain that the maximum heap size given to ZooKeeper is not bigger than the amount of real memory available to ZooKeeper.
Question : When do you get error with message "Too many open files" ( java.io.IOException), while writing to HBase. 1. When there is a available memory is very low and not able to open an HBase data file. 2. When there is not enough space on one of the datanode 3. Access Mostly Uused Products by 50000+ Subscribers 4. System do not allow more than 10 files for read and write in case of HBase Ans : 3 Exp : Limits on Number of Files and Processes (ulimit) Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to 1024 (or 256 on older versions of OS X). You can check this limit on your servers by running the command ulimit -n when logged in as the user which runs HBase. See Section 15.9.2.2, java.io.IOException...(Too many open files) for some of the problems you may experience if the limit is too low. You may also notice errors such as the following: 2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException 2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901 It is recommended to raise the ulimit to at least 10,000, but more likely 10,240, because the value is usually expressed in multiples of 1024. Each ColumnFamily has at least one StoreFile, and possibly more than 6 StoreFiles if the region is under load. The number of open files required depends upon the number of ColumnFamilies and the number of regions. The following is a rough formula for calculating the potential number of open files on a RegionServer.
Example Calculate the Potential Number of Open Files (StoreFiles per ColumnFamily) x (regions per RegionServer) For example, assuming that a schema had 3 ColumnFamilies per region with an average of 3 StoreFiles per ColumnFamily, and there are 100 regions per RegionServer, the JVM will open 3 * 3 * 100 = 900 file descriptors, not counting open JAR files, configuration files, and others. Opening a file does not take many resources, and the risk of allowing a user to open too many files is minimal. Another related setting is the number of processes a user is allowed to run at once. In Linux and Unix, the number of processes is set using the ulimit -u command. This should not be confused with the nproc command, which controls the number of CPUs available to a given user. Under load, a nproc that is too low can cause OutOfMemoryError exceptions. See Jack Levin's major hdfs issues thread on the hbase-users mailing list, from 2011. Configuring the fmaximum number of ile descriptors and processes for the user who is running the HBase process is an operating system configuration, rather than an HBase configuration. It is also important to be sure that the settings are changed for the user that actually runs HBase. To see which user started HBase, and that user's ulimit configuration, look at the first line of the HBase log for that instance.[2] ulimit Settings on Ubuntu. To configure ulimit settings on Ubuntu, edit /etc/security/limits.conf, which is a space-delimited file with four columns. Refer to the man page for limits.conf for details about the format of this file. In the following example, the first line sets both soft and hard limits for the number of open files (nofile) to 32768 for the operating system user with the username hadoop. The second line sets the number of processes to 32000 for the same user. hadoop - nofile 32768 hadoop - nproc 32000 The settings are only applied if the Pluggable Authentication Module (PAM) environment is directed to use them. To configure PAM to use these limits, be sure that the /etc/pam.d/common-session file contains the following line: session required pam_limits.so This exception states that the system can't create a new file because there are too many open files. You can increase the maximum number of open files in the system with the ulimit command.
Question : Which action will help to improve random reading performance ? 1. Increase the blocksize
4. Increasing the number of partitions in the table Ans : 2 Exp : Decreasing blocksize will improve random reads on small cell values like the 100 bytes in this item. ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the Section Block Cache, but it is not a guarantee that the entire table will be in memory. ColumnFamily BlockSize : The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved). Blocks are used for different things in HDFS and HBase. Blocks in HDFS are the unit of storage on disk. Blocks in HBase are a unit of storage for memory. There are many HBase blocks that fit into a single HBase file. HBase is designed to maximize efficiency from the HDFS file system, and they fully use the block size there. Some people have even tuned their HDFS to have 20GB block sizes to make HBase more efficient. If you have perfectly random access on a table that is much larger than memory, then the HBase cache will not help you. However, since HBase is intelligent in how it stores and retrieves data, it does not need to read an entire file block from HDFS to get at the data needed for a request. Data is indexed by key, and it is efficient to retrieve. Additionally, if you have designed your keys well to distribute data across your cluster, random reads will read equally from every server, so that the overall throughput is maximized.
Question :
Please select the correct staement for Hbase
1. HBase is a Distributed Database 2. HBase is Hadoop Application 3. Access Mostly Uused Products by 50000+ Subscribers 4. HBase should be used when you require realtime read or write random access to data 5. 1,2 and 4 are correct Ans : 5 Exp : HBase is a distributed column oriented database build on top of HDFS and it is a Hadoop Application, which can be used when the requirement is to realtime read or write random access to very large datasets.
Question : HBase is not a relational database but support SQL
1. 1. True 2. 2. False Ans : 2
Exp : Hbase is not a relational database and even it does not support SQL
Question :
In HBase data cells how versioning is implemented
1. HBase create unique cell id based on some numeric and character 2. Versioning is done based on Hash Key of the Cell value 3. Access Mostly Uused Products by 50000+ Subscribers 4. There is no versioning support in HBase
Ans : 3 Exp :HBase version Cell values and their versioning is maintained is on Current Timestamp which is auto assigned by HBase at the time of Cell Insertion
Question :
In HBase a cell content is an UnInterpreted array of Bytes
1. True 2. False
Ans : 1
Question :
Tables rows are sorted by table's Primary Key
1. True 2. False Ans : 1
Exp : Tables rows are sorted by Row Key the tables primary key and the sorting is Byte Ordered. All table access are by the table Prmary Key
Question : You will use MapR-DB JSON-based document database model for _________________ data formats. A. hierarchical B. nested C. varying D. evolving 1. A,B,C 2. B,C,D 3. Access Mostly Uused Products by 50000+ Subscribers 4. A,B,C,D 5. A,C,E