Question : When column is decided during schema design which statement correctly applies.. 1. Flushing and Compaction are per Region basis 2. The more Column Families the greater the I/O load 3. Access Mostly Uused Products by 50000+ Subscribers 4. All 1,2 and 3 are correct 5. Only 1 and 2 are correct
When deciding number of column families - Recommend no more than three Column Families - Flushing and Compaction are per Region basis If one CF is large the other CFs will also be flushed - Compaction is triggered by number of files per CF Not sized-based - The more Column Families the greater the I/O load - Column Families are defined by access scope data across CFs are typically not accessed simultaneously
Question : There is a feature provided in QuickTechie.com website that any Software Professional can create an article as well as can update and delete the article. You decided to use HBase rather than HDFS to store this article. You have created an ARTICLES table in HBase to store all the versions of the articles in this table. Now after one year you find that there is no use of keeping version older than 10 (version number) , hence you fire a query to delete to delete older version. Immediately you ran scan query on the same table and also next day you ran the same scan query, you found that there is a huge improvement on the scan operation on next day, what could be reason?
Correct Answer : Get Lastest Questions and Answer : Explanation: What is data locality and compaction? The term data locality refers to putting the data close to where it is needed. To have data locality, your cluster must share space with the Hadoop datanode, and the HBase Region Servers. The Hadoop datanode stores the data that the Region Server is managing, The data migrates to the Region Server node when the Region is compacted and re-written. When the data of the Region is local, it stays local for as long as that Region Server serves that Region. When something is written to HBase, it is first written to an in-memory store, called the MemStore. When the MemStore reaches a certain size, it is flushed to disk into a StoreFile. The store files that are created on disk are immutable. Sometimes the store files are merged together, which is done by a process called compaction. There are two kinds of compaction: major and minor. Minor compactions merge a small number of files. Major compaction merges all of the files in a region. Major compactions also remove deletes or expired versions. Why tune the compaction? By default, major compactions run every 24 hours and merge together all store files into one. After a major compaction runs, there is a single StoreFile for each store. Compactions can cause HBase to block writes to prevent JVM heap exhaustion. Major compactions can be triggered manually, which is the recommended procedure. You can schedule major compactions to occur automatically when the usage on the cluster is low. After a compaction, if a new store file is greater than a certain size (based on the property hbase.hregion.max.filesize), the region is split into two new regions. MapReduce tasks run close to the data that they process. This data locality is possible because large files in the distributed file system (DFS) are broken into smaller blocks. Each block maps to a task that is run to process the contained data. Larger block sizes mean that there are fewer map tasks to run, because the number of mappers is driven by the number of blocks that need processing. Hadoop knows where blocks are located and runs the map tasks directly on the node that hosts it. For HBase, the mappers run on the nodes of the region the mappers are scanning. Therefore, data locality of the region with its data is important to get real locality. HBase has two types of compaction: major compaction and minor compaction. A minor compaction merges (rewrites) the last two or more small files into one (set to 3 by default). A major compaction picks up all the files in the region, merges them and performs cleanup operations. During these cleanup operations, major compaction removes deleted key/values, tombstone markers, and all the duplicate key/values. A major compaction rewrites all HFiles from the same column family into a single HFile for each region. This results in fewer HFiles files. scan performance improves when you have fewer files (overall smaller size) to read; the result is that full table scans run faster immediately following a major compaction.
Question : All the articles stored in the ARTICLES table have a column called RowKey which is mainly used for article_id, following article id has been created using the random number generator. Now you want to find that whether article number 7654321 exist or not, there is a bloom filter enabled to give better performence on look up. What would be return by the bloom filter ? Row Keys : 5672343 , 0987543 , 2345895 , 1473578, 7654321, 907654
Correct Answer : Get Lastest Questions and Answer : Explanation: A Bloom filter is a kind of membership test -- it tells you whether an element is not or probably is in a set of elements. It is quick and memory-efficient. The trade-off is that it is probabilistic meaning that it informs you that the data is certainly not in a set or may be in a set. If your Bloom Filter returns true, it confirms that a key may be contained in a table. If Bloom Filter returns false, it confirms that a key is definitely not contained in a table. Enabling Bloom Filters may save your disk seek and improve read latency. Bloom Filters can be enabled per ColumnFamily. Use HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms per ColumnFamily. Default = NONE for no bloom filters. If ROW, the hash of the row will be added to the bloom on each insert. If ROWCOL, the hash of the row + column family + column family qualifier will be added to the bloom on each key insert. A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the larger the probability of false positives.
Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if "conventional" error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an error-free hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom's technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses
1. When data volume is huge e.g. TB to PB 2. When High throughput is needed e.g. 1000s queries per second 3. When there is a need of Higher Cache 4. When Data is Sparse 5. All of the above
1. When you only append data to your dataset and read the whole data 2. When you need random read 3. When you need random write 4. When access pattern is well known
1. In HBase every row has a Row Key 2. All columns in HBase are belongs to a particular column family 3. A table can have one or more column families 4. Table cells are versioned 5. All of the above