Cloudera Hadoop Administrator Certification Certification Questions and Answer (Dumps and Practice Questions)

Question : As a Hadoop Developer you always preferred using Mapreduce job chaining to execute multiple MapReduce job
as an output of one Job would be input of another job. But recently you learned that Apache OOzie is the best workflow
engine for Hadoop Jobs. Select the correct statement which you learned about Apache OOzie.

1. MapReduce jobs chain; no Pig or Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.
2. Iterative repetition of MapReduce jobs, shell scripts and Quartz scheduler until a desired answer or state is reached.

3. Access Mostly Uused Products by 50000+ Subscribers
4. MapReduce jobs chain and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.

Correct Answer : Get Lastest Questions and Answer :

Explanation: There are two basic types of Oozie jobs: Oozie Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions to execute. The Workflow job has to wait Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability. Oozie Bundle provides a way to package multiple coordinator and workflow jobs and to manage the lifecycle of those jobs. What Oozie Does
Apache Oozie allows Hadoop administrators to build complex data transformations out of multiple component tasks. This allows for greater control over complex jobs and also makes it easier to repeat those jobs at predetermined intervals. Apache Oozie helps administrators derive more value from their Hadoop investment.
How Oozie Works : An Oozie Workflow is a collection of actions arranged in a Directed Acyclic Graph (DAG) . Control nodes define job chronology, setting rules for beginning and ending a workflow, which controls the workflow execution path with decision, fork and join nodes. Action nodes trigger the execution of tasks.
Oozie triggers workflow actions, but Hadoop MapReduce executes them. This allows Oozie to leverage other capabilities within the Hadoop stack to balance loads and handle failures. Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it provides a unique callback HTTP URL to the task, thereby notifying that URL when it's complete. If the task fails to invoke the callback URL, Oozie can poll the task for completion.
Often it is necessary to run Oozie workflows on regular time intervals, but in coordination with unpredictable levels of data availability or events. In these circumstances, Oozie Coordinator allows you to model workflow execution triggers in the form of the data, time or event predicates. The workflow job is started after those predicates are satisfied. Oozie Coordinator can also manage multiple workflows that are dependent on the outcome of subsequent workflows. The outputs of subsequent workflows become the input to the next workflow. This chain is called a "data application pipeline".Oozie is a workflow system specifically built to work with Hadoop, MapReduce, and Pig job. An Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph). "Control dependency" from one action to another means that the second action can't run until the first action has completed. Oozie workflows definitions are written in hPDL (a XML Process Definition Language similar to JBOSS JBPM jPDL). Users write workflows in an XML language that define one or more MapReduce jobs, their interdependencies, and what to do in the case of failures. These workflows are uploaded to the Oozie server where they are scheduled to run or executed immediately. When Oozie executes a MapReduce job as part of a workflow, it is run by the Oozie server, which keeps track of job-level failures and status.

Oozie workflow actions start jobs in remote systems (i.e. Hadoop, Pig). Upon action completion, the remote systems callback Oozie to notify the action completion, at this point Oozie proceeds to the next

Question : Your cluster has datanodes, each with a single TB hard drive allocated to HDFS storage.
You reserve no disk space for MapReduce. You implement default replication settings. How much data can you store in HDFS (assuming no compression)?

1. about 5 TB
2. about 20 TB

3. Access Mostly Uused Products by 50000+ Subscribers

4. about 30 TB

Correct Answer : Get Lastest Questions and Answer :

Explanation: By default, Hadoop replicates each HDFS block three times. So if you cluster has a total of 60TB of raw disk space allocated to HDFS, you will be able to store 60/3 = 20TB of data on the cluster.

Question : How do you differentiate between failed task and killed task

1. A
2. B
3. Access Mostly Uused Products by 50000+ Subscribers
4. D

Correct Answer : Get Lastest Questions and Answer :

Related Questions

Question : Each machine in our cluster has GB of RAM. Some of this RAM should be reserved for Operating System usage. On each node,
we will assign 40 GB RAM for YARN to use and keep 8 GB for the Operating System. The following property sets the maximum memory YARN can utilize on the node:
yarn.nodemanager.resource.memory-mb --> 40960
YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines.
This also streamlines MapReduce to do what it does best, process data. With YARN, you can now run multiple applications in Hadoop,
all sharing a common resource management.

an example physical cluster of slave nodes each with 48 GB ram, 12 disks and 2 hex core CPUs (12 total cores).
In yarn-site.xml yarn.nodemanager.resource.memory-mb --> 40960s

The next step is to provide YARN guidance on how to break up the total resources available into Containers. You do this by specifying the minimum unit of RAM to allocate for a Container. We want to allow for a maximum of 20 Containers, and thus need (40 GB total RAM) / (20 # of Containers) = 2 GB minimum per container:

In yarn-site.xml yarn.scheduler.minimum-allocation-mb --> 2048
YARN will allocate Containers with RAM amounts greater than the yarn.scheduler.minimum-allocation-mb.

MapReduce 2 runs on top of YARN and utilizes YARN Containers to schedule and execute its map and reduce tasks.
When configuring MapReduce 2 resource utilization on YARN, there are three aspects to consider: * Physical RAM limit for each Map And Reduce task * The JVM heap size limit for each task * The amount of virtual memory each task will get

You can define how much maximum memory each Map and Reduce task will take. Since each Map and each Reduce will run in a separate Container, these maximum memory settings should be at least equal to or more than the YARN minimum Container allocation. For our example cluster, we have the minimum RAM for a Container (yarn.scheduler.minimum-allocation-mb) = 2 GB. We will thus assign 4 GB for Map task Containers, and 8 GB for Reduce tasks Containers.

In mapred-site.xml: mapreduce.map.memory.mb -->4096 mapreduce.reduce.memory.mb-->8192

Each Container will run JVMs for the Map and Reduce tasks. The JVM heap size should be set to lower than the Map and Reduce memory defined above, so that they are within the bounds of the Container memory allocated by YARN.In mapred-site.xml:
mapreduce.map.java.opts -->-Xmx3072m mapreduce.reduce.java.opts --> -Xmx6144m

In above scenerio..

1. YARN will be able to allocate on each node up to 10 mappers or 3 reducers or a permutation within that.
2. YARN will be able to allocate on each node up to 8 mappers or 5 reducers or a permutation within that.
3. Access Mostly Uused Products by 50000+ Subscribers
4. With YARN and MapReduce 2, you will have pre-configured static slots for Map and Reduce tasks

Question :
Which basic configuration parameters
must you set to migrate
your cluster from MapReduce 1 (MRv1) to
MapReduce V2 (MRv2)?

1. A,B,C
2. B,C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. B,D,E
5. D,E,F

Question :
Which is the default scheduler in YARN?

1. YARN doesn't configure a default scheduler, you must first assign an appropriate scheduler class in yarn-site.xml
2. Capacity Scheduler
3. Access Mostly Uused Products by 50000+ Subscribers
4. FIFO Scheduler

Question : Select the correct option which you will use to kill the already running MapReduce job in MRv

1. Do the ssh to the node running the ApplicationMaster and by using grep command find the process id and kill the session.

2. yarn application -kill "my_submited_job_application_id"
3. Access Mostly Uused Products by 50000+ Subscribers
4. hadoop datanode -rollback

Question : Which of the following is a correct command to submit yarn job, assuming your code is deployed in hadoopexam.jar

1. java jar hadoopexam.jar [mainClass] args...
2. yarn jar hadoopexam.jar [mainClass] args...
3. Access Mostly Uused Products by 50000+ Subscribers
4. yarn jar hadoopexam.jar args...

Question : Which of the following, command can be used to list all the jobs or application running in the resource manager

1. yarn application -list
2. yarn application -listAll
3. Access Mostly Uused Products by 50000+ Subscribers
4. yarn application -allJobs