Mapr (HP) Hadoop Developer Certification Questions and Answers (Dumps and Practice Questions)

Question : To process the www.HadoopExam.com MAIN.PROFILE.log file You submit a job to a cluster running on MRv.
There are 1000 slave nodes in a 100 rack, You have NOT specified a rack topology script. Your job has a single Reducer which runs on Node7 of Rack7.
The output file it writes is small enough to fit in a single HDFS block. How does Hadoop handle writing the output file?

1. The first replica of the block will be stored in any node out of 1000 nodes.
2. The first replica of the block will be stored on node7 of Rack7 only. The other two replicas will be stored on other nodes in any rack.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The first replica of the block will be stored on node7 in rack7. The other two replicas will be stored on node6 and node8 in rack7

Correct Answer : Get Lastest Questions and Answer :

Explanation: How does it is decided that on which datanode the replica of a block will be kept ?

Well there is a trade-off between the reliability and read and write bandwidth here. Let's suppose we have kept the replication factor as 1, so in this case all the blocks of any
data will be kept only once on any machine and if that machine goes down we won't be able to retrieve the data back. So here we got a reliability problem. But again placing all
replicas on a single node incurs the lowest write bandwidth penalty which is good.

Now let's say we have kept the replication factor as 5 so in this case if one, two or three of the node goes down we can still get the data blocks. And also while reading the data
there is more chance of data locality or data closeness of that datanode with the client. So the data retrieval will be fast, But the write bandwidth will be high in this case, and
also the data redundancy will be more.

As we see here that there is a trade off between reliability and read and write bandwidth, In 95% of the cases we keep the replication factor as three which is suitable for most of
the use case scenarios in production environment.

Again Hadoop's default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many
replicas on the same rack.Hadoop's block placement policy is such that the first copy of any block created by a client running on the cluster (as the Reducer is in this case) will be
written on the node running that client. The other two replicas will be stored on other nodes in the cluster. Watch the training from
http://hadoopexam.com/index.html/#hadoop-training

Question : In which of the following scenario we should use HBase

1. If it require random read, write or both
2. If it requires to do many thousands of operations per second on multiple TB of data
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above

Correct Answer : Get Lastest Questions and Answer :

Apache HBase : Use Apache HBase when you need random, realtime read or write access to your Big Data.
HBase goal is the hosting of very large tables
- billions of rows X millions of columns
- atop clusters of commodity hardware.
- If you know the access pattern in advance you can put all the data which are used together in a single column family
hence it access become faster.

Refer HadoopExam.com Recorded Training Module : 18

Question : In which scenario HBase should not be used

1. You only append to your dataset, and tend to read the whole thing
2. For ad-hoc analytics
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above
5. None of the above

Correct Answer : Get Lastest Questions and Answer :

When Should I Use or not HBase?

First, make sure you have enough data.
If you have hundreds of millions or billions of rows, then HBase is a good candidate.
If you only have a few thousand or million rows, then using a traditional RDBMS might
be a better choice due to the fact that all of your data might wind up on a single node (or two)
and the rest of the cluster may be sitting idle.

Second, make sure you can live without all the extra features that an RDBMS provides and Ad-hoc analysis could make your queries slower.

Related Questions

Question : You have to run MapReduce job, where Mapper is a Java class and Reducer is a Unix command "/bin/wc " . After completing your entire Job, you want that only two
partitions should be created. Select correct options which fulfill this requirement.

1. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-reducer=2 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
2. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.reduce.tasks=2 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
3. Access Mostly Uused Products by 50000+ Subscribers
-D mapred.reduce.count=2 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
4. As default file count would always be 2, hence no specific configuration is required.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc

Question : You have following data in a file callled HadoopExam.txt
Learning.Hadoop.From.HadoopExam.com
Learning.Spark.From.QuickTechie.com
Learning.Cassandra.From.Training4Exam.com
Learning.HBase.From.AWSExam.blogspot.com

Now from above data , while running Hadoop MapReduce streaming job, you want to creare key-set as below.

[Learning.Hadoop,Learning.Spark,Learning.Cassandra,Learning.HBase]
Which of the following is a correct code snippet.

1. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=15 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
2. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=2 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
3. Access Mostly Uused Products by 50000+ Subscribers
-D stream.map.output.field.separator=. \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
4. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.counts=2 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer

Question : The ________________ options allow you to make files and archives available to the tasks. The argument is a URI to the file or archive
that you have already uploaded to HDFS. These files and archives are cached across jobs. You can retrieve the host and fs_port values from the fs.default.name
config variable.

1. -files and -archives

2. -file and -archive

3. Access Mostly Uused Products by 50000+ Subscribers

4. -archives

Question : In hadoop streaming job, can we use following as mapper

-mapper "cut -f1 | sed s/foo/bar/g"

1. True
2. False

Question : Is below command valid for Hadoop Streaming Job
hadoop jar hadoop-streaming.jar -input '/user/foo/dir1' -input '/user/foo/dir2'

1. Yes
2. No

Question : Select the correct statement with regards to Hadoop streaming applications
A. A streaming process can use the stderr to emit counter information. reporter:counter:,, should be sent to stderr to update the counter.
B. A streaming process can use the stderr to emit status information. To set a status, reporter:status: should be sent to stderr.
C. You can use the record reader StreamXmlRecordReader to process XML documents.
D. During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapred.job.id becomes
mapred_job_id and mapred.jar becomes mapred_jar. In your code, use the parameter names with the underscores.

1. A,B,C
2. B,C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. A,B,D
5. A,B,C,D