Question : As a Hadoop Developer you always preferred using Mapreduce job chaining to execute multiple MapReduce job as an output of one Job would be input of another job. But recently you learned that Apache OOzie is the best workflow engine for Hadoop Jobs. Select the correct statement which you learned about Apache OOzie. 1. MapReduce jobs chain; no Pig or Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins. 2. Iterative repetition of MapReduce jobs, shell scripts and Quartz scheduler until a desired answer or state is reached.
Explanation: There are two basic types of Oozie jobs: Oozie Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions to execute. The Workflow job has to wait Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability. Oozie Bundle provides a way to package multiple coordinator and workflow jobs and to manage the lifecycle of those jobs. What Oozie Does Apache Oozie allows Hadoop administrators to build complex data transformations out of multiple component tasks. This allows for greater control over complex jobs and also makes it easier to repeat those jobs at predetermined intervals. Apache Oozie helps administrators derive more value from their Hadoop investment. How Oozie Works : An Oozie Workflow is a collection of actions arranged in a Directed Acyclic Graph (DAG) . Control nodes define job chronology, setting rules for beginning and ending a workflow, which controls the workflow execution path with decision, fork and join nodes. Action nodes trigger the execution of tasks. Oozie triggers workflow actions, but Hadoop MapReduce executes them. This allows Oozie to leverage other capabilities within the Hadoop stack to balance loads and handle failures. Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it provides a unique callback HTTP URL to the task, thereby notifying that URL when it's complete. If the task fails to invoke the callback URL, Oozie can poll the task for completion. Often it is necessary to run Oozie workflows on regular time intervals, but in coordination with unpredictable levels of data availability or events. In these circumstances, Oozie Coordinator allows you to model workflow execution triggers in the form of the data, time or event predicates. The workflow job is started after those predicates are satisfied. Oozie Coordinator can also manage multiple workflows that are dependent on the outcome of subsequent workflows. The outputs of subsequent workflows become the input to the next workflow. This chain is called a "data application pipeline".Oozie is a workflow system specifically built to work with Hadoop, MapReduce, and Pig job. An Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph). "Control dependency" from one action to another means that the second action can't run until the first action has completed. Oozie workflows definitions are written in hPDL (a XML Process Definition Language similar to JBOSS JBPM jPDL). Users write workflows in an XML language that define one or more MapReduce jobs, their interdependencies, and what to do in the case of failures. These workflows are uploaded to the Oozie server where they are scheduled to run or executed immediately. When Oozie executes a MapReduce job as part of a workflow, it is run by the Oozie server, which keeps track of job-level failures and status.
Oozie workflow actions start jobs in remote systems (i.e. Hadoop, Pig). Upon action completion, the remote systems callback Oozie to notify the action completion, at this point Oozie proceeds to the next
Question : Your cluster has datanodes, each with a single TB hard drive allocated to HDFS storage. You reserve no disk space for MapReduce. You implement default replication settings. How much data can you store in HDFS (assuming no compression)? 1. about 5 TB 2. about 20 TB
Explanation: By default, Hadoop replicates each HDFS block three times. So if you cluster has a total of 60TB of raw disk space allocated to HDFS, you will be able to store 60/3 = 20TB of data on the cluster.
Question : How do you differentiate between failed task and killed task
In above scenerio.. 1. YARN will be able to allocate on each node up to 10 mappers or 3 reducers or a permutation within that. 2. YARN will be able to allocate on each node up to 8 mappers or 5 reducers or a permutation within that. 3. Access Mostly Uused Products by 50000+ Subscribers 4. With YARN and MapReduce 2, you will have pre-configured static slots for Map and Reduce tasks