IBM Certified Data Architect - Big Data Certification Questions and Answers (Dumps and Practice Questions)

Question : In the Hadoop . framework, if HBase is also running on the same node for which available RAM is GB, so what is the ideal configuration
for "Reserved System Memory"

1. 1GB
2. 2GB
3. 3GB
4. No need to reserve

Correct Answer : 1

The total available RAM for YARN and MapReduce should take into account the Reserved Memory. Reserved Memory is the RAM needed by system processes and other Hadoop processes (such
as HBase).

Reserved Memory = Reserved for stack memory + Reserved for HBase Memory (If HBase is on the same node)

Use the following table to determine the Reserved Memory per node.

Reserved Memory Recommendations
Total Memory per Node Recommended Reserved System Memory Recommended Reserved HBase Memory
4 GB 1 GB 1 GB
8 GB 2 GB 1 GB
16 GB 2 GB 2 GB
24 GB 4 GB 4 GB
48 GB 6 GB 8 GB
64 GB 8 GB 8 GB
72 GB 8 GB 8 GB
96 GB 12 GB 16 GB
128 GB 24 GB 24 GB
256 GB 32 GB 32 GB
512 GB 64 GB 64 GB

Question : MapReduce runs on top of YARN and utilizes YARN Containers to schedule and execute its Map and Reduce tasks.
When configuring MapReduce resource utilization on YARN, which of the aspects to consider:

1. The physical RAM limit for each Map and Reduce task
2. The JVM heap size limit for each task.
3. The amount of virtual memory each task will receive.
4. 1 and 3
5. All 1,2 and 3

Correct Answer : 5
MapReduce runs on top of YARN and utilizes YARN Containers to schedule and execute its Map and Reduce tasks. When configuring MapReduce resource utilization on YARN, there are
three aspects to consider:

The physical RAM limit for each Map and Reduce task.

The JVM heap size limit for each task.

The amount of virtual memory each task will receive.

You can define a maximum amount of memory for each Map and Reduce task. Since each Map and Reduce task will run in a separate Container, these maximum memory settings should be
equal to or greater than the YARN minimum Container allocation.

Question : Assuming you're not running HDFS Federation, what is the maximum number of NameNode daemons you
should run on your cluster in order to avoid a split-brain scenario with your NameNode when running HDFS
High Availability (HA) using Quorum-based storage?

1. Two active NameNodes and two Standby NameNodes
2. One active NameNode and one Standby NameNode
3. Two active NameNodes and on Standby NameNode
4. Unlimited. HDFS High Availability (HA) is designed to overcome limitations on the number of NameNodes you can deploy

Correct Answer : 2

Explanation: In a typical HA cluster, two separate machines are configured as NameNodes. In a working cluster, one of the NameNode machine is in the Active state, and the other is in the
Standby state.

The Active NameNode is responsible for all client operations in the cluster, while the Standby is acts as a slave. The Standby machine maintains enough state to provide a fast
failover (if required).

In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called JournalNodes (JNs). When the
Active node performs any namespace modification, the Active node durably logs a modification record to a majority of these JNs. The Standby node reads the edits from the JNs and
continuously watches the JNs for changes to the edit log. Once the Standby Node observes the edits, it applies these edits to its own namespace. When using QJM, JournalNodes acts
the shared editlog storage. In a failover event, the Standby ensures that it has read all of the edits from the JounalNodes before promoting itself to the Active state. (This
mechanism ensures that the namespace state is fully synchronized before a failover completes.)

Note
Secondary NameNode is not required in HA configuration because the Standby node also performs the tasks of the Secondary NameNode.

In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information of the location of blocks in your cluster. To get accurate information
of the block locations, DataNodes are configured with the location of both the NameNodes and send block location information and heartbeats to both NameNode machines.

It is vital for the correct operation of an HA cluster that only one of the NameNodes should be Active at a time. Failure to do so, would cause the namespace state to quickly
diverge between the two NameNode machines thus causing potential data loss. (This situation is called as split-brain scenario.)
To prevent the split-brain scenario, the JournalNodes allow only one NameNode to be a writer at a time. During failover, the NameNode, that is to chosen to become active, takes
over the role of writing to the JournalNodes. This process prevents the other NameNode from continuing in the Active state and thus lets the new Active node proceed with the
failover safely.

Related Questions

Question : Suppose cluster resources become scarce and the scheduler ..

1. Stop working
2. All jobs will hangs for some time
3. All jobs will be stopped and needs to be re-started
4. ResourceManager symmetrically request back resources from a running application

Question :

The ____________ supports a number of features such as weights on queues (heavier queues get more containers), minimum shares, maximum shares, and FIFO policy within queues, but
the basic idea is to share the resources as uniformly as possible.

1. Fair Scheduler
2. Capacity Scheduler
3. FIFO Scheduler
4. Both 1 and 2
5. Both 2 and 3

Question : In your cluster ResourceManager is configured with the "Fair Scheduler", and on average every hour Hadoop runs jobs in parallel.
Now currently single job is running, how much of the resource capacity of the cluster will be used by this running single job.

1. 1/100 resource of the cluster
2. 20% of the cluster capacity
3. May use full capacity of the cluster
4. It can not be found

Question : Which of the following scheduler ensures that certain users, groups, or production applications always get sufficient resources.
When a queue contains waiting applications, it gets at least its minimum share

1. Fair Scheduler
2. Capacity Scheduler
3. FIFO Scheduler
4. Both 1 and 2
5. Both 2 and 3

Question :

Select the correct statement for the Fair Scheduler

1. allows containers to request variable amounts of memory and schedules based on those requirements
2. If an application is given a container that it cannot use immediately due to a shortage of memory, it can reserve that container, and no other application can use it until the container is released.
3. works best when the workloads are well known, which helps in assigning the minimum capacity.
4. 1 and 2
5. 1 and 3

Question : The Fair scheduler works best when there is a

1. When there is a need of Higher Memory
2. lot of variability between queues
3. workloads are well known
4. When there is a need of Higher CPU
5. When all the Jobs needs to be processed in submission order