Premium

IBM Certified Data Architect - Big Data Certification Questions and Answers (Dumps and Practice Questions)



Question :What is HBASE?
  :What is HBASE?
1. Hbase is separate set of the Java API for Hadoop cluster
2. Hbase is a part of the Apache Hadoop project that provides interface for scanning large amount of data using Hadoop infrastructure
3. Access Mostly Uused Products by 50000+ Subscribers
4. HBase is a part of the Apache Hadoop project that provides a SQL like interface for data processing.


Correct Answer : Get Lastest Questions and Answer :

Explanation: Hbase is one of the Hadoop framework projects that allow real time data scans across big data volumes. This is very often used to serve data from a cluster






Question :What is the role of the namenode?
  :What is the role of the namenode?
1. Namenode splits big files into smaller blocks and sends them to different datanodes
2. Namenode is responsible for assigning names to each slave node so that they can be identified by the clients
3. Access Mostly Uused Products by 50000+ Subscribers
4. Both 2 and 3 are valid answers




Correct Answer : Get Lastest Questions and Answer :


Explanation: The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also
supplies the specific addresses for the data based on the client requests







Question : What happen if a datanode loses network connection for a few minutes?


  : What happen if a datanode loses network connection for a few minutes?
1. The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back
online, administrator will need to manually delete the extra replicas
2. All data will be lost on that node. The administrator has to make sure the proper data distribution between nodes
3. Access Mostly Uused Products by 50000+ Subscribers
4. The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra
replicas will be deleted

Ans : 4
Exp : : The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node.
The moment the datanode is not available it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, over replicated data will be
deleted. Note: the data might be deleted from the original datanode.






Question : What happen if one of the datanodes has much slower CPU? How will it affect the performance of the cluster?


  : What happen if a datanode loses network connection for a few minutes?
1. The task execution will be as fast as the slowest worker.
However, if speculative execution is enabled, the slowest worker will not have such big impact
2. The slowest worker will significantly impact job execution time. It will slow everything down
3. Access Mostly Uused Products by 50000+ Subscribers
4. It depends on the level of priority assigned to the task. All high priority tasks are executed in parallel twice. A slower datanode would therefore be
bypassed. If task is not high priority, however, performance will be affected.
Ans : 1
Exp : Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will
be created and job tracker will take the first result into consideration and the second instance of the task will be killed


Question :

If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to
that file (assuming the default apache hadoop configuration)?


  : What happen if a datanode loses network connection for a few minutes?
1. 3
2. 6
3. Access Mostly Uused Products by 50000+ Subscribers
4. 12
Ans : 2
Exp : Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be
replicated according to replication factor settings (default 3). 2 * 3 = 6 .



Question : What is replication factor?

  : What happen if a datanode loses network connection for a few minutes?
1. Replication factor controls how many times the namenode replicates its metadata
2. Replication factor creates multiple copies of the same file to be served to clients
3. Access Mostly Uused Products by 50000+ Subscribers
4. None of these answers are correct.
Ans : 3
Exp : Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.



Question :

How does the Hadoop cluster tolerate datanode failures?


  : What happen if a datanode loses network connection for a few minutes?
1. Failures are anticipated. When they occur, the jobs are re-executed.
2. Datanodes talk to each other and figure out what need to be re-replicated if one of the nodes goes down
3. Access Mostly Uused Products by 50000+ Subscribers
4. Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively
maintains replication factor on all data.
Ans : 4
Exp : The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and
starts replication of the data the moment a disconnect is detected.




Question :

Which of the following tool, defines a SQL like language..


  : What happen if a datanode loses network connection for a few minutes?
1. Pig
2. Hive
3. Access Mostly Uused Products by 50000+ Subscribers
4. Flume
Ans 2



Question : As a client of HadoopExam, you are able to access the Hadoop cluster of HadoopExam Inc, Once a your application validates
its identity and is granted access to a file in a cluster, what is the remainder of the read path back to the client?
  : What happen if a datanode loses network connection for a few minutes?
1. The NameNode gives the client the block IDs and a list of DataNodes on which those blocks are found, and the application reads the blocks directly from the DataNodes.
2. The NameNode maps the read request against the block locations in its stored metadata, and reads those blocks from the DataNodes. The client application then reads
the blocks from the NameNode.
3. Access Mostly Uused Products by 50000+ Subscribers
4. DataNode closest to the client according to Hadoop's rack topology. The client application then reads the blocks from that single DataNode.



Correct Answer : Get Lastest Questions and Answer :


Explanation: When a client wishes to read a file from HDFS, it contacts the NameNode and requests the locations and names of the first few blocks in the file. It then directly contacts
the DataNodes containing those blocks to read the data. It would be very wasteful to move blocks around the cluster based on a client's read request, so that is never done.
Similarly, if all data was passed via the NameNode, the NameNode would immediately become a serious bottleneck and would slow down the cluster operation dramatically.

First, lets walk through the logic of performing an HDFS read operation. For this, well assume theres a file /user/esammer/foo.txt already in HDFS. In addition to using Hadoops
client library-usually a Java JAR file-each client must also have a copy of the cluster configuration data that specifies the location of the namenode The client begins by
contacting the namenode, indicating which file it would like to read. The client identity is first validated-either by trusting the client and allowing it to specify a username or
by using a strong authentication mechanism such as Kerberos and then checked against the owner and permissions of the file. If the file exists and the user has access to it, the
namenode responds to the client with the first block ID and the list of datanodes on which a copy of the block can be found, sorted by their distance to the client. Distance to the
client is measured according to Hadoops rack topology-configuration data that indicates
which hosts are located in which racks. If the namenode is unavailable for some reason-because of a problem with either the namenode itself or the network, for example-clients will
receive timeouts or exceptions (as appropriate) and will be unable to proceed. With the block IDs and datanode hostnames, the client can now contact the most appropriate datanode
directly and read the block data it needs. This process repeats until all blocks in the file have been read or the client closes the file stream.

It is also possible that while reading from a datanode, the process or host on which it runs, dies. Rather than give up, the library will automatically attempt to read another
replica of the data from another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. Another corner case that can occur is
that the information returned by the namenode about block locations can be outdated by the time the client attempts to contact a datanode, in which case either a retry will occur
if there are other replicas or the read will fail. While rare, these kinds of corner cases make troubleshooting a large distributed system such as Hadoop so complex.



Related Questions


Question : In the Hadoop . framework, if HBase is also running on the same node for which available RAM is GB, so what is the ideal configuration
for "Reserved System Memory"

 :  In the Hadoop . framework, if HBase is also running on the same node for which available RAM is  GB, so what is the ideal configuration
1. 1GB
2. 2GB
3. 3GB
4. No need to reserve


Question : MapReduce runs on top of YARN and utilizes YARN Containers to schedule and execute its Map and Reduce tasks.
When configuring MapReduce resource utilization on YARN, which of the aspects to consider:


  : MapReduce runs on top of YARN and utilizes YARN Containers to schedule and execute its Map and Reduce tasks.
1. The physical RAM limit for each Map and Reduce task
2. The JVM heap size limit for each task.
3. The amount of virtual memory each task will receive.
4. 1 and 3
5. All 1,2 and 3



Question : Assuming you're not running HDFS Federation, what is the maximum number of NameNode daemons you
should run on your cluster in order to avoid a split-brain scenario with your NameNode when running HDFS
High Availability (HA) using Quorum-based storage?


 : Assuming you're not running HDFS Federation, what is the maximum number of NameNode daemons you
1. Two active NameNodes and two Standby NameNodes
2. One active NameNode and one Standby NameNode
3. Two active NameNodes and on Standby NameNode
4. Unlimited. HDFS High Availability (HA) is designed to overcome limitations on the number of NameNodes you can deploy



Question : When running with N JournalNodes, the system can tolerate at most _____ failures and continue to function normally.
 : When running with N JournalNodes, the system can tolerate at most _____ failures and continue to function normally.
1. N/2
2. (N - 1) / 2
3. (N + 1) / 2
4. (N - 2) / 2


Question : Table schemas in Hive are:
 : Table schemas in Hive are:
1. Stored as metadata on the NameNode
2. Stored along with the data in HDFS
3. Stored in the Metadata
4. Stored in ZooKeeper
5. Stored in Hive Metastore


Question : __________ are responsible for local monitoring of resource availability, fault reporting,
and container life-cycle management (e.g., starting and killing
jobs).


 : __________ are responsible for local monitoring of resource availability, fault reporting,
1. NodeManagers
2. Application Manager
3. Application Master
4. Resource Manager