Cloudera Hadoop Developer Certification Questions and Answer (Dumps and Practice Questions)

Question :

Which of the following statements most accurately describes the relationship between MapReduce and Pig?

1. Pig provides additional capabilities that allow certain types of data manipulation not possible with MapReduce.
2. Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Pig provides the additional capability of allowing you to control the flow of multiple MapReduce jobs.

Correct Answer : Get Lastest Questions and Answer :

Question :Which of the following best describes the workings of TextInputFormat?

1. Input file splits may cross line breaks. A line that crosses tile splits is ignored.
2. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.
5. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.

Correct Answer : Get Lastest Questions and Answer :
Explanation: FileInputFormat gives the following info: For each input file, get the file length, the block size and calculate the split size as max(minSize, min(maxSize, blockSize)) where maxSize corresponds to mapred.max.split.size and minSize is mapred.min.split.size.
Divide the file into different FileSplits based on the split size calculated above. Whats important here is that each FileSplit is initialized with a start parameter corresponding to the offset in the input file. There is still no handling of the lines at that point. The relevant part of the code looks like this:
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP)
{ int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(new FileSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts()));
bytesRemaining -= splitSize; }
After that, if you look at the LineRecordReader which is defined by the TextInputFormat, thats where the lines are handled:
When you initialize your LineRecordReader it tries to instantiate a LineReader which is an abstraction to be able to read lines over FSDataInputStream. There are 2 cases:
If there is a CompressionCodec defined, then this codec is responsible for handling boundaries. Probably not relevant to your question.
If there is no codec however, that's where things are interesting: if the start of your InputSplit is different than 0, then you backtrack 1 character and then skip the first line you encounter identified by \n or \r\n (Windows) ! The backtrack is important because in case your line boundaries are the same as split boundaries, this ensures you do not skip the valid line. Here is the relevant code:
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job); end = Long.MAX_VALUE;
} else {
if (start != 0) { skipFirstLine = true; --start; fileIn.seek(start);
} in = new LineReader(fileIn, job); } if (skipFirstLine) {
// skip first line and re-establish
"start". start += in.readLine(new Text(), 0, (int)Math.min((long)Integer.MAX_VALUE, end - start));
} this.pos = start;
So since the splits are calculated in the client, the mappers don't need to run in sequence, every mapper already knows if it neds to discard the first line or not.
So basically if you have 2 lines of each 100Mb in the same file, and to simplify let's say the split size is 64Mb. Then when the input splits are calculated, we will have the following scenario:
Split 1 containing the path and the hosts to this block. Initialized at start 200-200=0Mb, length 64Mb.
Split 2 initialized at start 200-200+64=64Mb, length 64Mb.
Split 3 initialized at start 200-200+128=128Mb, length 64Mb.
Split 4 initialized at start 200-200+192=192Mb, length 8Mb.
Mapper A will process split 1, start is 0 so don't skip first line, and read a full line which goes beyond the 64Mb limit so needs remote read.
Mapper B will process split 2, start is != 0 so skip the first line after 64Mb-1byte, which corresponds to the end of line 1 at 100Mb which is still in split 2, we have 28Mb of the line in split 2, so remote read the remaining 72Mb.
Mapper C will process split 3, start is != 0 so skip the first line after 128Mb-1byte, which corresponds to the end of line 2 at 200Mb, which is end of file so don't do anything.
Mapper D is the same as mapper C except it looks for a newline after 192Mb-1byte.

Question :

In a MapReduce job, you want each of you input files processed by a single map task.
How do you configure a MapReduce job so that a single map task processes each input
file regardless of how many blocks the input file occupies?

1. Increase the parameter that controls minimum split size in the job configuration.
2. Write a custom MapRunner that iterates over all key-value pairs in the entire file.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Write a custom FileInputFormat and override the method isSplittable to always return false.

Correct Answer : Get Lastest Questions and Answer :

Explanation: When isSplitable returns false only a single mapper processes the entire file. The mapper can emit any number of KV pairs.
Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.
API Describe as below
Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.

Related Questions

Question :

What is the result of the following command (the database username is foo and password is bar)?
$ sqoop list-tables - - connect jdbc : mysql : / / localhost/databasename - - table - - username foo -
- password bar

1. sqoop lists only those tables in the specified MySql database that have not already been imported into FDFS
2. sqoop returns an error
3. Access Mostly Uused Products by 50000+ Subscribers
4. sqoopimports all the tables from SQLHDFS

Question :

Which best describes the primary function of Flume?

1. Flume is a platform for analyzing large data sets that consists of a high level language for
expressing data analysis programs, coupled with an infrastructure consisting of sources and sinks
for importing and evaluating large data sets
2. Flume acts as a Hadoop filesystem for log files
3. Access Mostly Uused Products by 50000+ Subscribers
4. Flume provides a query languages for Hadoop similar to SQL
5. Flume is a distributed server for collecting and moving large amount of data into HDFS as its
produced from streaming data flows

Question :

You need to analyze 60,000,000 images stored in JPEG format, each of which is approximately 25
KB. Because your Hadoop cluster isn't optimized for storing and processing many small files you
decide to do the following actions:
1. Group the individual images into a set of larger files
2. Use the set of larger files as input for a MapReduce job that processes them directly with
Python using Hadoop streaming
Which data serialization system gives you the flexibility to do this?

A. CSV
B. XML
C. HTML
D. Avro
E. Sequence Files
F. JSON

1. A,B
2. C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. D,E
5. C,E

Question :

You have user profile records in an OLTP database that you want to join with web server logs
which you have already ingested into HDFS. What is the best way to acquire the user profile for
use in HDFS?
A. Ingest with Hadoop streaming
B. Ingest with Apache Flume
C. Ingest using Hive's LOAD DATA command
D. Ingest using Sqoop
E. Ingest using Pig's LOAD command

1. A,B
2. C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. D,E
5. A,E

Question : Map the following in case of YARN

1. YARN Resource Manager
2. YARN Node Managers
3. Access Mostly Uused Products by 50000+ Subscribers

a. which launch and monitor the tasks of jobs
b. allocates the cluster resources to jobs
c. which coordinates the tasks running in the MapReduce job

1. 1-a, 2-b,3-c
2. 1-b, 2-a,3-c
3. Access Mostly Uused Products by 50000+ Subscribers
4. 1-a, 2-c,3-b

Question : Developer has submitted the YARN Job, by calling submitApplication() method on Resource Manager.
Please select the correct order of the below stpes after that

1. Container will be managed by Node Manager after job submission
2. Resource Manager triggers its sub-component Scheduler, which allocates containers for mapreduce job execution.
3. Access Mostly Uused Products by 50000+ Subscribers

1. 2,3,1
2. 1,2,3
3. Access Mostly Uused Products by 50000+ Subscribers
4. 1,3,2