Question : You have network servers producing timeseries data from network traffic logs. You want to attain high write throughput for storing this data in an HBase table. Which of these should you choose for a row key to maximize your write throughput? 1. (hashCode(centralServerGeneratedSequenceID)>(timestamp> 2. (timestamp>
Explanation: Randomization can be used to ensure the input records are not in sorted order. Using an MD5 hash code, you can make a random distribution of the key across all available region servers.
In this case, the rowkey needs to be designed to have high write throughout. If your rowkey contains the same server information, it will write to the same region. You want to avoid writing to the the same region. It makes most sense to use random keys to distribute across different regions: this will avoid creating region hot spots. You can design the rowkey with hash-coded server information and add a timestamp.
More Information: HBase Documentation on schema design study case section 6.11.1.4. Variangle Length or Fixed Length Rowkeys? It is critical to remember that rowkeys are stamped on every column in HBase. If the hostname is "a" and the event type is "e1" then the resulting rowkey would be quite small. However, what if the ingested hostname is "myserver1.mycompany.com" and the event type is "com.package1.subpackage2.subsubpackage3.ImportantService"?
It might make sense to use some substitution in the rowkey. There are at least two approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it might look like this:
Composite Rowkey With Hashes:
[MD5 hash of hostname] = 16 bytes [MD5 hash of event-type] = 16 bytes [timestamp] = 8 bytes
Question : If you have more than one tables in RDBMS which are frequently joined to fetch the data, now you want migrate these tables in HBase. Please select correct statement from below.. 1. Create all the tables each with multiple column families in HBASE 2. Create a single table with as many column families as tables 3. Access Mostly Uused Products by 50000+ Subscribers for all the tables 4. Any of the above will fine
Explanation: each row is basically a linked list, ordered by column family and then column name. This is how it's laid down on disk, as well. Missing columns are free, because there is no space on disk pre-allocated to a null column. Given that, it's reasonable to design a schema where rows have hundreds or thousands of columns.
Just as columns are laid down on disk like a linked list, so too are rows. They are put on disk in order by row key. Because row keys can by any collection of bytes, ordering of row keys is lexicographical, aka alphabetical. This is in contrast to most RDBMS, where rowkeys are integers and ordered as such.
Consider the following row key order: 1 - 256 - 43 - 7. The row key 265 is actually before 43, because 2 comes before 4. This is why it's common in HBase to make at least parts of your row key fixed width, ex: 00000001 - 00000007 - 00000043 - 00000256. However, now we have another problem known as hot spotting.
If all your row keys start with the same value, then they will all be going to the same region, and thus the same server. This could easily happen for monotonically increasing row keys, such as traditional RDBMS auto-incrementing PKs, or for timestamps. This can cause all the load for a big write job to block waiting for a single region server, versus spreading out the writes to the whole cluster. A common way to avoid this is to prefix row keys, for example by the md5 hash of the customer ID.
Rows can most efficiently be read back by scanning for consecutive blocks. Say you have a table with a rowkey of customer-date-user. You can easily read back all the data for a given customer and date range using the prefix customer-first-part-of-date, but you can't easily read back dates ranges for all users at once without scanning all the rows. If you reverse the row key and use customer-user-date, you have the reverse problem. So you want to think about what your primary read pattern is going to be when designing your keys.
Say your primary read patten is going to be reading off the most recent rows. Depending on the format of the dates in your row keys, you may end up with the more recent data at the end of the table. For example: 20130101 > 20130102 > 20130303. Instead, a common pattern is to invert your dates, such as 79869898 > 79869897 > 798698986. This may not apply if you will know at run time the range of values that will be the most recent, i.e. the last 30 days. Access patterns are an important factor in HBase schema design. Even though the two tables in this scenario have very different data sizes and formats, it is better to store them in one table if you are accessing them together most of the time.
Column families allow for separation of data. You can store different types of data and format into different column families. Attributes such as compression, Bloom filters, and replication are set on per column family basis. In this example, it is better to store product information and product images into two different column families and one table.
"Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics."
Question : You have a table with the following rowkeys based on the date: 21012010, 22012010, 23012010,21052010,28012010,24012010,29012010 In which order will these rows be retrieved from a scan? 1. 21012010, 22012010, 23012010, 21052010, 28012010, 24012010, 29012010 2. 21012010, 21052010, 22012010, 23012010, 24012010, 28012010, 29012010 3. Access Mostly Uused Products by 50000+ Subscribers 4. It could be in any random order
Explanation: HBase table contents are sorted in the following order: rowkey, column family, and column qualifier, and timestamp. It keeps them in a lexicographical order. In this example, we need to compare only the rowkeys. First sorted by the first character, "2", and the second character "21", "22", and "23". The next order is by the third charactor, fourth character and so on.... 21012010, 21052010, 22012010, 23012010, 24012010, 28012010, 29012010 All data model operations HBase return data in sorted order. First by row, then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted in reverse, so newest records are returned first). The row key is treated by HBase as an array of bytes but it must have a string representation. A special property of the row key Map is that it keeps them in a lexicographical order. For example, numbers going from 1 to 100 will be ordered like this: 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,...,9,91,92,93,94,95,96,97,98,99 The row key is treated by HBase as an array of bytes but it must have a string representation. A special property of the row key Map is that it keeps them in a lexicographical order. For example, numbers going from 1 to 100 will be ordered like this: 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,...,9,91,92,93,94,95,96,97,98,99 To keep the integers natural ordering, the row keys have to be left-padded with zeros. To take advantage of this, the functionalities of the row key Map are augmented by offering a scanner which takes a start row key (if not specified, the first one in the table) and an stop row key (if not specified, the last one in the table). For example, if the row keys are dates in the format YYYYMMDD, getting the month of July 2008 is a matter of opening a scanner from 20080700 to 20080800. It does not matter if the specified row keys are existing or not, the only thing to keep in mind is that the stop row key will not be returned which is why the first of August is given to the scanner. Column Families : A column family regroups data of a same nature in HBase and has no constraint on the type. The families are part of the table schema and stay the same for each row; what differs from rows to rows is that the column keys can be very sparse. For example, row "20080702" may have in its "info:" family the following column keys: info:aaa info:bbb info:ccc While row "20080703" only has: info:12342 Developers have to be very careful when using column keys since a key with a length of zero is permitted which means that in the previous example data can be inserted in column key "info:". We strongly suggest using empty column keys only when no other keys will be specified. Also, since the data in a family has the same nature, many attributes can be specified regarding performance and timestamps. Timestamps : The values in HBase may have multiple versions kept according to the family configuration. By default, HBase sets the timestamp to each new value to current time in milliseconds and returns the latest version when a cell is retrieved. The developer can also provide its own timestamps when inserting data as he can specify a certain timestamp when fetching it.
1. scan throughput increases and random access latency decreases. 2. scan throughput decreases and random access latency increases. 3. Access Mostly Uused Products by 50000+ Subscribers 4. scan throughput increases and random access latency increases