Question : You are working with a Advertising company called Acmeshell, now you have collected more than . million logos and images of your clients which were stored in HBase, And you have web application where where you retrieve these images. In which format will your data be returned from an HBase scan?
Explanation: HBase supports a "bytes-in/bytes-out" interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes. There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailling list for conversations on this topic. All rows in HBase conform to the datamodel, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily. Counters : One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See Increment in HTable. Synchronization on counters are done on the RegionServer, not in the client.
Question : You have a Software Professional Website called QuickTechie.com where everyday user create new articles. You extract these all articles from MySQL database to a file called 12012014Articles.txt. In the hadoop shell you fire the following command. Select the correct statement which applies.
hadoop fs -put 12012014Articles.txt /12012014
1. Copies the article txt file 12012014Articles.txt from default HDFS directory into the HDFS directory /hdfs/hive//warehosue/12012014 2. Copies the article txt file 12012014Articles.txt from default HDFS directory into the HDFS directory /hdfs 3. Copies the article txt file 12012014Articles.txt from default HDFS directory into the HDFS directory /hdfs/12012014 4. Copies the article txt file 12012014Articles.txt from local directory into the HDFS directory 12012014
Correct Answer : 4 Explanation: Hadoop shell command for put put Usage: hadoop fs -put (localsrc) ... (dst) Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem.
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. hadoop fs -put localfile /user/hadoop/hadoopfile
In this example, localfile is testlog.1512, and destination direcotry is /testlog
Question : You have downloaded HBase from the Apache distribution and did not change any HDFS settings. Now you have created a setup in which as soon as new article is committed by a Software Engineer on the website called QuickTechie.com, it will be pushed to HBase. While saving the article in HBase table, you observed that it first write Write Ahead Log (WAL), what could be the reason?
1. It will cache the data so it can give high read throughput
2. It will cache the data so it can give high write throughput
3. If RegionServer fails before persisting the data to final location, data will be always avaibale and avoid any data loss. 4. It helps the even distribution of data across the all data centers.
Correct Answer : 3
Explanation: HBase internally puts your data in indexed StoreFiles that exist on HDFS for the fast random Read/Write access. HBase simply stores files in HDFS. It does so for the actual data files (HFile) as well as its log (WAL). Each HBase RegionServer runs on DataNode and performs data operations (e.g., compact, split, store) into HDFS via WAL, memstore and HFile. If a RegionServer fails, data that was stored in that server's memstore can be recovered from its WAL. This is how HBase achieves durability. Each RegionServer adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to theMemStore for the affectedStore.This ensures that HBase has durable writes. Without WAL, there is the possibility of data loss in the case of a RegionServer failure before each MemStore is flushed and new StoreFiles are written. HLog is the HBase WAL implementation, and there is one HLog instance per RegionServer. " The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. Under normal operations, the WAL is not needed because data changes move from the MemStore to StoreFiles. However, if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed. If writing to the WAL fails, the entire operation to modify the data fails. " HLog records all the edits to HStore. It is the hbase write-ahead-log (WAL). " A RegionServer serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file includes information about which region it belongs to. When a region is opened, the edits in the WAL file which belong to that region need to be replayed. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails. HBase is the Hadoop storage manager that provides low-latency random reads and writes on top of HDFS, and it can handle petabytes of data. One of the interesting capabilities in HBase is auto-sharding, which simply means that tables are dynamically distributed by the system when they become too large. The basic unit of horizontal scalability in HBase is called a Region. Regions are a subset of the table's data and they are essentially a contiguous, sorted range of rows that are stored together. Initially, there is only one region for a table. As shown below, when regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves. n HBase the slaves are called Region Servers. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server.
The HBase architecture has two main services: HMaster that is responsible to coordinate the cluster and execute administrative operations, and the HRegionServer responsible for handling a subset of the table's data. HMaster, Region Assignment, and Balancing : As previously mentioned, the HBase Master coordinates the HBase Cluster and is responsible for administrative operations. A Region Server can serve one or more Regions. Each Region is assigned to a Region Server on startup and the master can decide to move a Region from one Region Server to another as the result of a load balance operation. The Master also handles Region Server failures by assigning the region to another Region Server. The mapping of Regions and Region Servers is kept in a system table called META. By reading META, you can identify which region is responsible for your key. This means that for read and write operations, the master is not involved at all and clients can go directly to the Region Server responsible to serve the requested data.