Question : The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called: 1. Combine 2. IdentityMapper 3. Access Mostly Uused Products by 50000+ Subscribers 4. Default Partitioner 5. Speculative Execution
Correct Answer : Get Lastest Questions and Answer : Explanation: Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. Note: * Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed. Failed tasks are tasks that error out. * There are a few reasons Hadoop can kill tasks by his own decisions: a) Task does not report progress during timeout (default is 10 minutes) b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler). c) Speculative execution causes results of task not to be needed since it has completed on other place.
Question : You've written a MapReduce job that will process million input records and generated 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network? 1. A. Partitioner 2. OutputFormat 3. Access Mostly Uused Products by 50000+ Subscribers 4. Writable 5. Combiner
Correct Answer : Get Lastest Questions and Answer : Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative.
Question : You are using MapR Hadoop framework to analyzing financial data, with some data modeling algorithms. And this algorithms are written in Java and created a Jar file out of this, with approx. size of 2 MB. Which is the best way to make this library available to your MapReduce job at runtime?
1. Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job. What else is the requirement of the Class using using this libjars. 2. Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Package your code and the Apache Commands Math library into a zip file named JobJar.zip
Correct Answer : Get Lastest Questions and Answer : Explanation:You can use GenericOptionParser and Tool Runner to direclty pass your jar to -D libjars at the runtime, whe nyou trigger the job. Also worth to note subtle but important point: the way to specify additional JARs for JVMs running distributed map reduce tasks and for JVM running job client is very different.
-libjars makes Jars only available for JVMs running remote map and reduce task
To make these same JAR's available to the client JVM (The JVM that's created when you run the hadoop jar command) need to set HADOOP_CLASSPATH environment variable:
$ export LIBJARS=/path/jar1,/path/jar2 $ export HADOOP_CLASSPATH=/path/jar1:/path/jar2 $ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS} -mytoolopt value
Another cause of incorrect -libjars behaviour could be in wrong implementation and initialization of custom Job class.
Job class must implement Tool interface Configuration class instance must be obtained by calling getConf() instead of creating new instance;
The usage of the jar command is like this, Usage: hadoop jar [mainClass] args... If you want the commons-math3.jar to be available for all the tasks you can do any one of these 1. Copy the jar file in $HADOOP_HOME/lib dir or 2. Use the generic option -libjars.
Question : You have Sqoop to import the EVENT table from the database, then write a Hadoop streaming job in Python to scrub the data, and use Hive to write the new data into the Hive EVENT table. How would you automate this data pipeline? 1. Using first Sqoop job and then remaining Part using MapReduce job chaining. 2. Define the Sqoop job, the MapReduce job, and the Hive job as an Oozie workflow job, and define an Oozie coordinator job to run the workflow job daily. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Define the Sqoop job, the MapReduce job, and the Hive job as an Oozie workflow job, and define an Zookeeper coordinator job to run the workflow job daily.
Ans :2 Exp : In Oozie, scheduling is the function of an Oozie coordinator job. Oozie does not allow you to schedule workflow jobs Oozie coordinator jobs cannot aggregate tasks or define workflows; coordinator jobs are simple schedules of previously defined worksflows. You must therefore assemble the various tasks into a single workflow job and then use a coordinator job to execute the workflow job.
Question : QuickTechie Inc has a log file which is tab-delimited text file. File contains two columns username and loginid You want use an InputFormat that returns the username as the key and the loginid as the value. Which of the following is the most appropriate InputFormat should you use?