Question :What is HBASE? 1. Hbase is separate set of the Java API for Hadoop cluster 2. Hbase is a part of the Apache Hadoop project that provides interface for scanning large amount of data using Hadoop infrastructure 3. Access Mostly Uused Products by 50000+ Subscribers 4. HBase is a part of the Apache Hadoop project that provides a SQL like interface for data processing.
Explanation: Hbase is one of the Hadoop framework projects that allow real time data scans across big data volumes. This is very often used to serve data from a cluster
Question :What is the role of the namenode? 1. Namenode splits big files into smaller blocks and sends them to different datanodes 2. Namenode is responsible for assigning names to each slave node so that they can be identified by the clients 3. Access Mostly Uused Products by 50000+ Subscribers 4. Both 2 and 3 are valid answers
Explanation: The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests
Question : What happen on the namenode when a client tries to read a data file?
1. The namenode will look up the information about file in the edit file and then retrieve the remaining information from filesystem memory snapshot 2. The namenode is not involved in the retrieving data file since the data is stored on datanodes 3. Access Mostly Uused Products by 50000+ Subscribers 4. None of these answers are correct Ans : 1 Exp : Since the namenode needs to support a large number of the clients, the primary namenode will only send information back for the data location. The datanode itselt is responsible for the retrieval.
Question :What mode(s) can Hadoop code be run in?
1. Hadoop can be deployed in distributed mode only 2. Hadoop can be deployed in stand-alone mode or distributed mode 3. Access Mostly Uused Products by 50000+ Subscribers 4. None of these are applicable modes for Hadoop Ans: 3 Exp : Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes.
Question :What is the role of the secondary namenode?
1. Secondary namenode is a backup namenode that will serve requests if primary namenode goes down 2. Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem snapshots 3. Access Mostly Uused Products by 50000+ Subscribers 4. There is no secondary namenode Ans : 2 Exp : The secondary namenode was separated out as a process due to having CPU intensive operations and additional requirements for metadata back-up.
Question :What are the side effects of not running a secondary name node?
1. The cluster performance will degrade over time since edit log will grow bigger and bigger 2. The primary namenode will become overloaded and response time be slower. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The only possible impact is when there is an outage a failover to the secondary namenode will not occur. This is a rare occurence Ans :1 Exp : If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image.
Question :What happen if a datanode loses network connection for a few minutes?
1. The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, administrator will need to manually delete the extra replicas 2. All data will be lost on that node. The administrator has to make sure the proper data distribution between nodes 3. Access Mostly Uused Products by 50000+ Subscribers 4. The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra replicas will be deleted
Ans : 4 Exp : The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not avaialble it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated data will be deleted. Note: the data might be deleted from the original datanode.
Question :What happen if one of the datanodes has much slower CPU? How will it effect the performance of the cluster?
1. The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact 2. The slowest worker will significantly impact job execution time. It will slow everything down 3. Access Mostly Uused Products by 50000+ Subscribers 4. It depends on the level of priority assigned to the task. All high priority tasks are executed in parallel twice. A slower datanode would therefore be bypassed. If task is not high priority, however, performance will be affected. Ans : 1 Exp : Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed
Question :
If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache and cloudera configuration)?
1. 3 2. 6 3. Access Mostly Uused Products by 50000+ Subscribers 4. 12 Ans : 2 Exp : Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings (default 3). 2 * 3 = 6 .
Question :
What is replication factor?
1. Replication factor controls how many times the namenode replicates its metadata 2. Replication factor creates multiple copies of the same file to be served to clients 3. Access Mostly Uused Products by 50000+ Subscribers 4. None of these answers are correct. Ans : 3 Exp : Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.
Question :
What daemons run on Master nodes?
1. NameNode, DataNode, JobTracker and TaskTracker 2. NameNode, DataNode and JobTracker 3. Access Mostly Uused Products by 50000+ Subscribers 4. NameNode, Secondary NameNode, JobTracker, TaskTracker and DataNode Ans : 3 Exp : Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave nodes.
Question :
What is the role of the jobtracker in an Hadoop cluster?
1. The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks 2. The job tracker is responsible for job monitoring on the cluster, sending alerts to master nodes 3. Access Mostly Uused Products by 50000+ Subscribers 4. Both 1 and 3 are valid answers Ans :1 Exp : The job tracker is the main component of the map-reduce execution. It control the division of the job into smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results back to calling code
Question :
How does the Hadoop cluster tolerate datanode failures?
1. Failures are anticipated. When they occur, the jobs are re-executed. 2. Datanodes talk to each other and figure out what need to be re-replicated if one of the nodes goes down 3. Access Mostly Uused Products by 50000+ Subscribers 4. Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively maintains replication factor on all data. Ans : 4 Exp : The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the moment a disconnect is detected.
Question :
In the Reducer, the MapReduce API has an iterator over writable values. Calling the next() method
1. Returns a reference to a different Writable object each time. 2. Returns a reference to a writable object from an object pool. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Returns a reference to a same writable object if the next value is the same as the previous value,or a new writable object otherwise.
1. Flume is a distributed service 2. it is used to moving large amount of data as it is produced 3. It is ideal for gathering logs from multiple systems 4. Can be used to inserting logs into HDFS 5. All of the above
1. Flume can continue deliver events in the face of system component failure 2. Flume can scales horizontally 3. Flume provides a central Master controller for manageability 4. 1 and 3 5. None of the above
1. The Master holds configuration information for each node 2. Node communicate with the Master in every 5 seconds 3. Nodes passes its version number to master 4. 1 and 3 are correct 5. 1,2 and 3