Question : You have a file bases data source. Where data is continuously added. Now ,you need to import this data in Hadoop , HDFS. Which of the following tool help you to implement this? 1. Sqoop
Correct Answer : Get Lastest Questions and Answer : Explanation: Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.
Question : Which of the following feature is supported by IBM GPFS 1. There is a requirement where logical isolation and physical isolation need to be supported
2. There is a requirement for separate clusters for analytics and databases
4. There is a need to run the NameNode in a separate high availability environment
Correct Answer : Get Lastest Questions and Answer : Explanation: GPFS supports local disks on cluster nodes and storage area networks (SANs). Logical isolation and physical isolation are supported so that file sets can be separate file systems inside of a file system (logical isolation), or can be part of separate storage pools (physical isolation). InfoSphere BigInsights uses a customized version of GPFS that supports all existing GPFS commands, and provides additional interfaces and commands. GPFS supports thousands of nodes and petabytes of storage so that you can modify the scale to meet your most demanding needs. Data is replicated on multiple nodes so that no single point of failure exists, whereas the NameNode is a single point of failure in HDFS. You can push updates asynchronously or synchronously allowing you to choose how you want to manage changes from a primary system to a secondary system. If a node fails, changes are replicated to other nodes. When the failed node is operational, GPFS quickly determines which blocks must be recovered. Changes that occured while the node was down are copied to the previously failed node so that the node is synchronized with other nodes in the cluster. Applications define their own logical block size by segmenting data into file blocks. Each file block is determined based on the effective block size or chunk size. Applications can also determine replication layout by using either wide striping over the network, write affinity on a local disk, or a combination of both layouts. Allowing applications to dictate block size and replication layout provides greater performance and efficiency over HDFS.
Question : You are creating an hadoop based solution. You need to consider archival size of the cluster. Which of the following , you need to consider for deciding the cluster archival size? 1. Replication factor
Correct Answer : Get Lastest Questions and Answer : Explanation: If your minimum size of storage on the cluster is not double the size of your data, then you will end up having under-replicated blocks. Under-replicated block are those which are replicated < replication factor, so if you're replication factor is 2, you will have blocks will have replication factor of 1.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Consider that one of the nodes in your cluster goes down. That node would have some data stored in it and if you do not replicate your data, then a part of your data will not be available due to the node failure. However, if your data is replicated, the data which was on the node which went down will still be accessible to you from other nodes.
If you do not feel the need to replicate your data, you can always set your replication factor = 1.