IBM Certified Data Architect - Big Data Certification Questions and Answers (Dumps and Practice Questions)

Question : In the YARN design, Map-Reduce is just one

1. Resource Manager
2. Application
3. Container
4. None of the above

Correct Answer : Get Lastest Questions and Answer :

Exp: The ApplicationMaster is the process that coordinates an application's execution in the cluster. Each application has its own unique ApplicationMaster,
which is tasked with negotiating resources (containers) from the ResourceManager and working with the NodeManager(s) to execute and monitor the
tasks. In the YARN design, Map-Reduce is just one application framework; this design permits building and deploying distributed applications using other
frameworks. For example, YARN ships with a Distributed-Shell application that allows a shell script to be run on multiple nodes on the YARN cluster.

Once the ApplicationMaster is started (as a container), it will periodically send heartbeats to the ResourceManager to affirm its health and to update the
record of its resource demands. After building a model of its requirements, the ApplicationMaster encodes its preferences and constraints in a heartbeat
message to the ResourceManager. In response to subsequent heartbeats, the ApplicationMaster will receive a lease on containers bound to an allocation
of resources at a particular node in the cluster. Depending on the containers it receives from the ResourceManager, the ApplicationMaster may update its
execution plan to accommodate the excess or lack of resources. Container allocation/deallocation can take place in a dynamic fashion as the application
progresses.

Question : Select the correct statement for HDFS in Hadoop .

1. NameNode federation significantly improves the scalability and performance of HDFS by introducing the ability to deploy multiple NameNodes for a single cluster.
2. built-in high availability for the NameNode via a new feature called the Quorum Journal Manager (QJM). QJM-based HA features an active NameNode and a standby NameNode
3. The standby NameNode can become active either by a manual process or automatically
4. 1 and 3
5. 1,2 and 3

Correct Answer : Get Lastest Questions and Answer :
Exp: Hadoop 2 offers significant improvements beyond YARN-namely, improvements in the HDFS (Hadoop File System) that can influence infrastructure
decisions. Whether to use NameNode federation and NameNode HA (high availability) are the two important decisions that must be made by most
organizations. NameNode federation significantly improves the scalability and performance of HDFS by introducing the ability to deploy multiple
NameNodes for a single cluster. In addition, HDFS introduces built-in high availability for the NameNode via a new feature called the Quorum Journal
Manager (QJM). QJM-based HA features an active NameNode and a standby NameNode. The standby NameNode can become active either by a manual
process or automatically. Automatic failover works in coordination with ZooKeeper. Hadoop 2 HDFS introduces the ZKFailoverController, which uses
ZooKeeper's election functionality to determine the active NameNode.

Question : Select the correct statement which applies to "Fair Scheduler"

1. Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time
2. By default, the Fair Scheduler bases scheduling fairness decisions only on CPU
3. It can be configured to schedule with both memory and CPU
4. 1 and 3
5. 1 2 and 3

Correct Answer : Get Lastest Questions and Answer :

Exp: Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of
scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU,
using the notion of Dominant Resource Fairness developed by Ghodsi et al. When there is a single app running, that app uses the entire cluster. When other apps are submitted,
resources that free up are assigned to the new apps, so that each app eventually on gets roughly the same amount of resources. Unlike the default Hadoop scheduler, which forms a
queue of apps, this lets short apps finish in reasonable time while not starving long-lived apps. It is also a reasonable way to share a cluster between a number of users. Finally,
fair sharing can also work with app priorities - the priorities are used as weights to determine the fraction of total resources that each app should get.

Related Questions

Question : What determines where blocks are written into HDFS by client applications?

1. The client queries the NameNode, which returns information on which DataNodes to use and the client writes to those DataNodes
2. The client writes immediately to DataNodes based on the cluster's rack locality settings

3. Access Mostly Uused Products by 50000+ Subscribers

4. The client writes immediately to DataNodes at random

Question : How does the NameNode know which DataNodes are currently available on a cluster?

1. DataNodes are listed in the dfs.hosts file. The NameNode uses that as the definitive list of available DataNodes.
2. DataNodes heartbeat in to the master on a regular basis.

3. Access Mostly Uused Products by 50000+ Subscribers
4. The NameNode broadcasts a heartbeat on the network on a regular basis, and DataNodes respond.

Question : How does the HDFS architecture provide data reliability?

1. Storing multiple replicas of data blocks on different DataNodes.

2. Reliance on SAN devices as a DataNode interface.
3. Access Mostly Uused Products by 50000+ Subscribers

4. DataNodes make copies of their data blocks, and put them on different local disks.

Question :What is HBASE?

1. Hbase is separate set of the Java API for Hadoop cluster
2. Hbase is a part of the Apache Hadoop project that provides interface for scanning large amount of data using Hadoop infrastructure
3. Access Mostly Uused Products by 50000+ Subscribers
4. HBase is a part of the Apache Hadoop project that provides a SQL like interface for data processing.

Question :What is the role of the namenode?

1. Namenode splits big files into smaller blocks and sends them to different datanodes
2. Namenode is responsible for assigning names to each slave node so that they can be identified by the clients
3. Access Mostly Uused Products by 50000+ Subscribers
4. Both 2 and 3 are valid answers

Question : What happen if a datanode loses network connection for a few minutes?

1. The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back
online, administrator will need to manually delete the extra replicas
2. All data will be lost on that node. The administrator has to make sure the proper data distribution between nodes
3. Access Mostly Uused Products by 50000+ Subscribers
4. The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra
replicas will be deleted

Ans : 4
Exp : : The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node.
The moment the datanode is not available it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, over replicated data will be
deleted. Note: the data might be deleted from the original datanode.

Question : What happen if one of the datanodes has much slower CPU? How will it affect the performance of the cluster?

1. The task execution will be as fast as the slowest worker.
However, if speculative execution is enabled, the slowest worker will not have such big impact
2. The slowest worker will significantly impact job execution time. It will slow everything down
3. Access Mostly Uused Products by 50000+ Subscribers
4. It depends on the level of priority assigned to the task. All high priority tasks are executed in parallel twice. A slower datanode would therefore be
bypassed. If task is not high priority, however, performance will be affected.
Ans : 1
Exp : Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will
be created and job tracker will take the first result into consideration and the second instance of the task will be killed

Question :

If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to
that file (assuming the default apache hadoop configuration)?

1. 3
2. 6
3. Access Mostly Uused Products by 50000+ Subscribers
4. 12
Ans : 2
Exp : Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be
replicated according to replication factor settings (default 3). 2 * 3 = 6 .

Question : What is replication factor?

1. Replication factor controls how many times the namenode replicates its metadata
2. Replication factor creates multiple copies of the same file to be served to clients
3. Access Mostly Uused Products by 50000+ Subscribers
4. None of these answers are correct.
Ans : 3
Exp : Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.

Question :

How does the Hadoop cluster tolerate datanode failures?

1. Failures are anticipated. When they occur, the jobs are re-executed.
2. Datanodes talk to each other and figure out what need to be re-replicated if one of the nodes goes down
3. Access Mostly Uused Products by 50000+ Subscribers
4. Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively
maintains replication factor on all data.
Ans : 4
Exp : The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and
starts replication of the data the moment a disconnect is detected.

Question :

Which of the following tool, defines a SQL like language..

1. Pig
2. Hive
3. Access Mostly Uused Products by 50000+ Subscribers
4. Flume
Ans 2

Question : As a client of HadoopExam, you are able to access the Hadoop cluster of HadoopExam Inc, Once a your application validates
its identity and is granted access to a file in a cluster, what is the remainder of the read path back to the client?

1. The NameNode gives the client the block IDs and a list of DataNodes on which those blocks are found, and the application reads the blocks directly from the DataNodes.
2. The NameNode maps the read request against the block locations in its stored metadata, and reads those blocks from the DataNodes. The client application then reads
the blocks from the NameNode.
3. Access Mostly Uused Products by 50000+ Subscribers
4. DataNode closest to the client according to Hadoop's rack topology. The client application then reads the blocks from that single DataNode.