Mapr (HP) Hadoop Developer Certification Questions and Answers (Dumps and Practice Questions)

Question : The Hadoop framework provides a mechanism for coping with machine issues such as
faulty configuration or impending hardware failure. MapReduce detects that one or a
number of machines are performing poorly and starts more copies of a map or reduce task.
All the tasks run simultaneously and the task finish first are used. This is called:

1. Combine
2. IdentityMapper
3. Access Mostly Uused Products by 50000+ Subscribers
4. Default Partitioner
5. Speculative Execution

Correct Answer : Get Lastest Questions and Answer :
Explanation: Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the
rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are
already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.

By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate
input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the
Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative
execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing
speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed
successfully, first.
Note: * Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.
Failed tasks are tasks that error out. * There are a few reasons Hadoop can kill tasks by his own decisions: a) Task does not report progress during timeout (default is 10 minutes)
b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler)
or queue (CapacityScheduler). c) Speculative execution causes results of task not to be needed since it has completed on other place.

Question : You've written a MapReduce job that will process million input records and generated
500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will
create a significant amount of intermediate data that it needs to transfer between mappers
and reduces which is a potential bottleneck. A custom implementation of which interface is
most likely to reduce the amount of intermediate data transferred across the network?

1. A. Partitioner
2. OutputFormat
3. Access Mostly Uused Products by 50000+ Subscribers
4. Writable
5. Combiner

Correct Answer : Get Lastest Questions and Answer :
Explanation: Combiners are used to increase the efficiency of a MapReduce program.
They are used to aggregate intermediate map output locally on individual mapper outputs.
Combiners can help you reduce the amount of data that needs to be transferred across to
the reducers. You can use your reducer code as a combiner if the operation performed is
commutative and associative.

Question : You are using MapR Hadoop framework to analyzing financial data, with some data modeling algorithms.
And this algorithms are written in Java and created a Jar file out of this, with approx. size of 2 MB.
Which is the best way to make this library available to your MapReduce job at runtime?

1. Have your system administrator copy the JAR to all nodes in the cluster and set its
location in the HADOOP_CLASSPATH environment variable before you submit your job. What else is the requirement of the Class using using this libjars.
2. Have your system administrator place the JAR file on a Web server accessible to all
cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Package your code and the Apache Commands Math library into a zip file named JobJar.zip

Correct Answer : Get Lastest Questions and Answer :
Explanation:You can use GenericOptionParser and Tool Runner to direclty pass your jar to -D libjars at the runtime, whe nyou trigger the job.
Also worth to note subtle but important point: the way to specify additional JARs for JVMs running distributed map reduce tasks and for JVM running job client is very different.

-libjars makes Jars only available for JVMs running remote map and reduce task

To make these same JAR's available to the client JVM (The JVM that's created when you run the hadoop jar command) need to set HADOOP_CLASSPATH environment variable:

$ export LIBJARS=/path/jar1,/path/jar2
$ export HADOOP_CLASSPATH=/path/jar1:/path/jar2
$ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS} -mytoolopt value

Another cause of incorrect -libjars behaviour could be in wrong implementation and initialization of custom Job class.

Job class must implement Tool interface
Configuration class instance must be obtained by calling getConf() instead of creating new instance;

The usage of the jar command is like this,
Usage: hadoop jar [mainClass] args...
If you want the commons-math3.jar to be available for all the tasks you can do any one of
these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.

Related Questions

Question : Secondary NameNode is a backup for NameNode ?

1. True
2. False

Question : You have a MapReduce job which is dependent on two external jdbc jars called ojdbc.jar and openJdbc.jar
which of the following command will correctly include this external jars in the running Jobs classpath

1. hadoop jar job.jar HadoopExam -cp ojdbc6.jar,openJdbc6.jar
2. hadoop jar job.jar HadoopExam -libjars ojdbc6.jar,openJdbc6.jar
3. Access Mostly Uused Products by 50000+ Subscribers
4. hadoop jar job.jar HadoopExam -libjars ojdbc6.jar openJdbc6.jar
Ans : 2
Exp : The syntax for executing a job and including archives in the job's classpath is: hadoop jar -libjars ,[,...]

Question : You have Sqoop to import the EVENT table from the database,
then write a Hadoop streaming job in Python to scrub the data,
and use Hive to write the new data into the Hive EVENT table.
How would you automate this data pipeline?

1. Using first Sqoop job and then remaining Part using MapReduce job chaining.
2. Define the Sqoop job, the MapReduce job, and the Hive job as an Oozie workflow job, and define an Oozie coordinator job to run the workflow job daily.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Define the Sqoop job, the MapReduce job, and the Hive job as an Oozie workflow job,
and define an Zookeeper coordinator job to run the workflow job daily.

Ans :2
Exp : In Oozie, scheduling is the function of an Oozie coordinator job.
Oozie does not allow you to schedule workflow jobs
Oozie coordinator jobs cannot aggregate tasks or define workflows;
coordinator jobs are simple schedules of previously defined worksflows.
You must therefore assemble the various tasks into a single workflow
job and then use a coordinator job to execute the workflow job.

Question : QuickTechie Inc has a log file which is tab-delimited text file. File contains two columns username and loginid
You want use an InputFormat that returns the username as the key and the loginid as the value. Which of the following
is the most appropriate InputFormat should you use?

1. KeyValueTextInputFormat
2. MultiFileInputFormat
3. Access Mostly Uused Products by 50000+ Subscribers
4. SequenceFileInputFormat
5. TextInputFormat

Question : In the QuickTechie Inc Hadoop cluster you have defined block size as MB. The input file contains MB of valid input data
and is loaded into HDFS. How many map tasks should run without considering any failure of MapTask during the execution of this job?

1. 1
2. 2
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4

Question : What is data localization ?

1. Before processing the data, bringing them to the local node.
2. Hadoop will start the Map task on the node where data block is kept via HDFS
3. Access Mostly Uused Products by 50000+ Subscribers
4. None of the 1 and 2 is correct

Question : All the mappers, have to communicate with all the reducers...

1. True
2. False

Question : Mapper and Reducer runs on the same machine then output of the Mapper will not be transferred via network to the reducer

1. True
2. False