Question : Suppose cluster resources become scarce and the scheduler .. 1. Stop working 2. All jobs will hangs for some time 3. All jobs will be stopped and needs to be re-started 4. ResourceManager symmetrically request back resources from a running application
Correct Answer : 4 As previously described, the ResourceManager is the master that arbitrates all the available cluster resources, thereby helping manage the distributed
applications running on the YARN system. It works together with the per-node NodeManagers and the per-application ApplicationMasters. In YARN, the ResourceManager is primarily limited to scheduling-that is, it allocates available resources in the system among the competing applications but does not concern itself with per-application state management. The scheduler handles only an overall resource profile for each application, ignoring local optimizations and internal application flow. In fact, YARN completely departs from the static assignment of map and reduce slots because it treats the cluster as a resource pool. Because of this clear separation of responsibilities coupled with the modularity described previously, the ResourceManager is able to address the important design requirements of scalability and support for alternative programming paradigms. In contrast to many other workflow schedulers, the ResourceManager also has the ability to symmetrically request back resources from a running application. This situation typically happens when cluster resources become scarce and the scheduler decides to reclaim some (but not all) of the resources that were given to an application.
In YARN, ResourceRequests can be strict or negotiable. This feature provides ApplicationMasters with a great deal of flexibility on how to fulfill the reclamation requests-for example, by picking containers to reclaim that are less crucial for the computation, by checkpointing the state of a task, or by migrating the computation to other running containers. Overall, this scheme allows applications to preserve work, in contrast to platforms that kill containers to satisfy resource constraints. If the application is noncollaborative, the ResourceManager can, after waiting a certain amount of time, obtain the needed resources by instructing the NodeManagers to forcibly terminate containers.
ResourceManager failures remain significant events affecting cluster availability. As of this writing, the ResourceManager will restart running ApplicationMasters as it recovers its state. If the framework supports restart capabilities-and most will for routine fault tolerance-the platform will automatically restore users pipelines. In contrast to the Hadoop 1.0 JobTracker, it is important to mention the tasks for which the ResourceManager is not responsible. Other than tracking application execution flow and task fault tolerance, the ResourceManager will not provide access to the application status (servlet; now part of the ApplicationMaster) or track previously executed jobs, a responsibility that is now delegated to the JobHistoryService (a daemon running on a separated node). This is consistent with the view that the ResourceManager should handle only live resource scheduling, and helps YARN central components scale better than Hadoop 1.0 JobTracker.
Question :
The ____________ supports a number of features such as weights on queues (heavier queues get more containers), minimum shares, maximum shares, and FIFO policy within queues, but the basic idea is to share the resources as uniformly as possible.
1. Fair Scheduler 2. Capacity Scheduler 3. FIFO Scheduler 4. Both 1 and 2 5. Both 2 and 3
Correct Answer : 1
The Fair scheduler is a third pluggable scheduler for Hadoop that provides another way to share large clusters. Fair scheduling is a method of assigning resources to applications such that all applications get, on average, an equal share of resources over time.
In Hadoop version 1, the Fair scheduler uses the term "pool" to refer to a queue. Starting with the YARN Fair scheduler, the term "queue" will be used instead of "pool." To provide backward compatibility with the original Fair scheduler, "queue" elements can be named as "pool" elements.
In the Fair scheduler model, every application belongs to a queue. YARN containers are given to the queue with the least amount of allocated resources. Within the queue, the application that has the fewest resources is assigned the container. By default, all users share a single queue, called "default." If an application specifically lists a queue in a container resource request, the request is submitted to that queue. It is also possible to configure the Fair scheduler to assign queues based on the user name included with the request. The Fair scheduler supports a number of features such as weights on queues (heavier queues get more containers), minimum shares, maximum shares, and FIFO policy within queues, but the basic idea is to share the resources as uniformly as possible.
Question : In your cluster ResourceManager is configured with the "Fair Scheduler", and on average every hour Hadoop runs jobs in parallel. Now currently single job is running, how much of the resource capacity of the cluster will be used by this running single job.
1. 1/100 resource of the cluster 2. 20% of the cluster capacity 3. May use full capacity of the cluster 4. It can not be found
Correct Answer : 3 Under the Fair scheduler, when a single application is running, that application may request the entire cluster (if needed). If additional applications are submitted, resources that are free are assigned "fairly" to the new applications so that each application gets roughly the same amount of resources. The Fair scheduler also applies the notion of preemption, whereby containers can be requested back from the ApplicationMaster. Depending on the configuration and application design, preemption and subsequent assignment can be either friendly or forceful.
1. By loading both social data in the current Enterprise Data Warehouse, then run analytics.
2. By loading social data in BigInsights for exploration then moving resulting data to Enterprise Data Warehouse, and merging with expense data for analytics