Question : What determines where blocks are written into HDFS by client applications?
1. The client queries the NameNode, which returns information on which DataNodes to use and the client writes to those DataNodes 2. The client writes immediately to DataNodes based on the cluster's rack locality settings
Explanation: The NameNode will return a list of DataNodes to which the client should write. The contents of the file are never sent to the NameNode
Question : How does the NameNode know which DataNodes are currently available on a cluster? 1. DataNodes are listed in the dfs.hosts file. The NameNode uses that as the definitive list of available DataNodes. 2. DataNodes heartbeat in to the master on a regular basis.
Explanation: DataNodes heartbeat in to the master every three seconds. When a DataNode heartbeats in to the NameNode the first time, the NameNode marks it as being available. DataNodes can be listed in a file pointed to by thedfs.hosts property, but this only lists the names of possible DataNodes. It is not a definitive list of those which are available but, rather, a list of the only machines which may be used as DataNodes if they begin to heartbeat.
Question : How does the HDFS architecture provide data reliability? 1. Storing multiple replicas of data blocks on different DataNodes.
Exp: HDFS provides reliability by splitting a file into multiple blocks, and replicating each block on multiple different machines (3 by default). Although it is possible to use RAID on DataNodes, this is not a recommended configuration as it reduces the amount of raw disk which can be used for data storage and is not necessary.
1. General application information: ApplicationId, queue to which the application was submitted, user who submitted the application and the start time for the application. 2. ApplicationMaster details: the host on which the ApplicationMaster is running, the rpc port (if any) on which it is listening for requests from clients and a token that the client needs to communicate with the ApplicationMaster. 3. Application tracking information: If the application supports some form of progress tracking, it can set a tracking url which is available via ApplicationReport#getTrackingUrl that a client can look at to monitor progress. 4. ApplicationStatus: The state of the application as seen by the ResourceManager is available via ApplicationReport#getYarnApplicationState. If the YarnApplicationState is set to FINISHED, the client should refer to ApplicationReport#getFinalApplicationStatus to check for the actual success/failure of the application task itself. In case of failures, ApplicationReport#getDiagnostics may be useful to shed some more light on the the failure. 5. All of the above
1. The ApplicationMaster is the actual owner of the job. It will be launched by the ResourceManager and via the client will be provided all the necessary information and resources about the job that it has been tasked with to oversee and complete. 2. As the ApplicationMaster is launched within a container that may (likely will) be sharing a physical host with other containers, given the multi-tenancy nature, amongst other issues, it cannot make any assumptions of things like pre-configured ports that it can listen on. 3. When the ApplicationMaster starts up, several parameters are made available to it via the environment. These include the ContainerId for the ApplicationMaster container, the application submission time and details about the NodeManager host running the Application Master. Ref ApplicationConstants for parameter names 4. 1 and 2 5. 1,2 and 3
Question : Select the correct option which is/are correct 1. YARN takes into account all of the available compute resources on each machine in the cluster. 2. Based on the available resources, YARN negotiates resource requests from applications (such as MapReduce) running in the cluster. 3. YARN then provides processing capacity to each application by allocating Containers. 4. 1 and 3 5. 1,2 and 3