Cloudera Hadoop Developer Certification Questions and Answer (Dumps and Practice Questions)

Question : You have two tables in Hive that are populated with data:
Employee
emp_id int
salary string

Employee_Detail;
emp_id int
name string

You now create a new table de-normalized one and populate it with the results of joining the two tables as follows:
CREATE TABLE EMPLOYEE_FULL AS SELECT Employee_Detail.*,Employee.salary AS s
FROM Employee JOIN Employee_Detail ON (Employee.emp_id== Employee_Detail.emp_id);

You then export the table and download the file:
EXPORT TABLE EMPLOYEE_FULL TO '/hadoopexam/employee/Employee_Detail.data';

You have downloaded the file and read the file as a CSV in R. How many columns will the resulting variable in R have?

1. 1
2. 2
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4
5. 5

Ans : 1
Exp : : When exporting a table from Hive, the data file will use the delimiters form the table.
Because table3 wasn't created with specific delimiters, it will use the default Hive delimiter,
which is \001 or Control-A. When the file is imported into R as a CSV,
there will be only 1 column because the file isn't actually comma delimited.

Question : You use Sqoop to import a table from your RDBMS into HDFS.
You know that Sqoop typically instantiates four Mappers. However, after the table import,
you notice that five Mappers have run, there are five output files in HDFS, and one of the output files is empty. Why?

1. The administrator has set the sqoop.num.maps property on the slave nodes to 7
2. Some Map tasks failed and had to be rerun
3. Access Mostly Uused Products by 50000+ Subscribers
4. The HDFS block size was set to a very small value, resulting in more Mappers than usual running
5. The table was modified by a user of the RDBMS as Sqoop was running

Ans : 3
Exp : If some Map task attempts failed, they would be rerun but no data from the failed task attempts would be stored on disk. There is no sqoop.num.maps property. Sqoop typically reads the table in a single transaction, so modifying the data would have no effect; and the HDFS block size is irrelevant to the number of files created. The correct answer is that by default, Sqoop uses the table's primary key to determine how to split the data. If there is no numeric primary key, Sqoop will make a best-guess attempt at how the data is distributed, and may run more than its default four Mappers, although some may end up not actually reading any data.

Question : Using Apache Sqoop you can import the data to

1. Apache Hive
2. Apache HBase
3. Access Mostly Uused Products by 50000+ Subscribers
4. Only 1 and 3
5. All 1,2 and 3

Ans : 5
Exp : : Apache Sqoop can be used to import data from any relational DB into HDFS, Hive or HBase.
To import data into HDFS, use the sqoop import command and specify the relational DB table and connection parameters:

sqoop import --connect "JDBC connection string" --table "tablename" --username "username" --password "password"
This will import the data and store it as a CSV file in a directory in HDFS. To import data into Hive, use the sqoop import command and specify the option 'hive-import'.

sqoop import --connect "JDBC connection string" --table "tablename" --username "username" --password "password" --hive-importThis will import the data into a Hive table with the approproate data types for each column.

Question : You decide to use Hive to process data in HDFS. You have not created any Hive tables until now.
Hive is configured with its default settings. You run the following commands from the Hive shell:

CREATE DATABASE HADOOPEXAM;
USE HADOOPEXAM;
CREATE TABLE HE_EMPLOYEE(EMP_NAME STRING, EMP_ID INT, EMP_SALARY INT);

In which directory should you now store data so that it can be accessed as table t1 in database HADOOPEXAM?

1. /user/hive/warehouse/HADOOPEXAM/HE_EMPLOYEE
2. /user/hive/warehouse/HADOOPEXAM.db/HE_EMPLOYEE
3. Access Mostly Uused Products by 50000+ Subscribers
4. /user/hive/HADOOPEXAM.db/HE_EMPLOYEE
5. /user/hive/warehouse/HADOOPEXAM HE_EMPLOYEE

Ans : 2
Exp : : When you create a database named HADOOPEXAM in Hive, that creates a subdirectory of Hive's warehouse directory named HADOOPEXAM.db.
All tables are placed in subdirectories of HADOOPEXAM.db; those subdirectory names are the names of the tables

Question :. For HadoopExam.com user profiles you need to analyze roughly ,, JPEG files of all the.
Each file is no more than 3kB.Because your Hadoop cluster isn't optimized for storing and processing many small files,
you decide to group the files into a single archive. The toolkit that will be used to process
the files is written in Ruby and requires that it be run with administrator privileges.
Which of the following file formats should you select to build your archive?

1. TIFF
2. SequenceFiles
3. Access Mostly Uused Products by 50000+ Subscribers
4. MPEG
5. Avro

Ans : 5

Exp :The two formats that are best suited to merging small files into larger archives for processing in Hadoop are Avro and SequenceFiles. Avro has Ruby bindings; SequenceFiles are only supported in Java.

JSON, TIFF, and MPEG are not appropriate formats for archives. JSON is also not an appropriate format for image data.

Question : SequenceFiles are flat files consisting of binary key/value pairs. SequenceFile provides Writer, Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.
There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs:
You have created a SequenceFile (MAIN.PROFILE.log) with custom key and value types. What command displays the contents of a
SequenceFile named MAIN.PROFILE.log in your terminal in human-readable format?

1. hadoop fs -decrypt MAIN.PROFILE.log
2. hadoop fs -text MAIN.PROFILE.log
3. Access Mostly Uused Products by 50000+ Subscribers
4. hadoop fs -encode MAIN.PROFILE.log

Correct Answer : Get Lastest Questions and Answer :

Explanation: SequenceFiles are flat files consisting of binary key/value pairs.SequenceFile provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively. There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs: Writer : Uncompressed records.
RecordCompressWriter : Record-compressed files, only compress values.
BlockCompressWriter : Block-compressed files, both keys & values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
The actual compression algorithm used to compress key and/or values can be specified by using the appropriate CompressionCodec. The recommended way is to use the static createWriter methods provided by the SequenceFile to chose the preferred format. The SequenceFile.Reader acts as the bridge and can read any of the above SequenceFile formats. SequenceFile Formats Essentially there are 3 different formats for SequenceFiles depending on the CompressionType specified. All of them share a common header described below. SequenceFile Header version - 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
keyClassName -key class
valueClassName - value class
compression - A boolean which specifies if compression is turned on for keys/values in this file.
blockCompression - A boolean which specifies if block-compression is turned on for keys/values in this file.
compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
metadata - SequenceFile.Metadata for this file. sync - A sync marker to denote end of the header. Uncompressed SequenceFile Format Header, Record , Record length , Key length , Key, Value A sync-marker every few 100 bytes or so.A SequenceFile contains the name of the classes used for the key and value as part of its header. hadoop fs -text reads the records, and calls the toString() method of the relevant class to display human-readable output on the console. The hadoop fs -cat command would display the raw data from the file, which is not human-readable. hadoop fs -get retrieves the file from HDFS and places it on the local disk, which is not what was required. The other options are syntactically incorrect.

Question : Speculative execution is an optimization technique where a computer system performs
some task that may not be actually needed. The main idea is to do work before it is known whether that work will be needed at all,
so as to prevent a delay that would have to be incurred by doing the work after it is known whether it is needed. If it turns out the work was not needed
after all, any changes made by the work are reverted and the results are ignored. In a ETL MapReduce job which will use Mappers to process data
and then using DBMSOutputFormat with the Reducers you directly push to Oracle database. Select the correct statement which applies for
speculative execution.

1. Disable speculative execution for the data insert job
2. Enable speculative execution for the data insert job
3. Access Mostly Uused Products by 50000+ Subscribers
4. Configure only single mapper for the data insert job

Correct Answer : Get Lastest Questions and Answer :

Explanation: I usually disable speculative execution for MapReduce task when I write to RDBMS in Hive user defined table function.

set mapred.map.tasks.speculative=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.reduce.tasks.speculative.execution=false;

And if you tune the mapred.reduce.tasks, you can control RDBMS session-running number. It is good also to use Batch mode and control the commit If we do not disable speculative execution, it is possible that multiple instances of a given Reducer could run, which would result in more data than was intended being inserted into the target RDBMS. None of the other options presented is required; although you need the database driver on the client machine if you plan to connect to the RDBMS from that client, it does not need to be present. It is certainly not necessary for yours to be the only job running on the cluster, and the values ofdfs.datanode.failed.volumes.tolerated and the block size of the input data are irrelevant. Finally, the RDBMS does not need to allow passwordless login.

Question : Apache MRUnit is a Java library that helps developers unit test Apache Hadoop map reduce jobs.
MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20 , 0.23.x , 1.0.x , 2.x version of Hadoop
You have a Reducer which simply sums up the values for any given key. You write a unit test in MRUnit to test the Reducer, with this code:
@Test
public void testETLReducer() {
List < IntWritable > values = new ArrayList < IntWritable > ();
values.add(new IntWritable(1));
values.add(new IntWritable(1));
List < IntWritable > values2 = new ArrayList < IntWritable > ();
values2.add(new IntWritable(1));
values2.add(new IntWritable(1));
reduceDriver.withInput(new LongWritable("5673"), values);
reduceDriver.withInput(new LongWritable("109098"), values2);
reduceDriver.withOutput(new LongWritable("109098"), new IntWritable(2));
reduceDriver.runTest();
} What is the result?

1. The test will pass with warning and error
2. The test will pass with no warning and error
3. Access Mostly Uused Products by 50000+ Subscribers
4. Code will not compile

Correct Answer : Get Lastest Questions and Answer : Example : @Test
public void testMapReduce() {
mapReduceDriver.withInput(new LongWritable(), new Text(
"655209;1;796764372490213;804422938115889;6"));
List (IntWritable) values = new ArrayList(IntWritable)();
values.add(new IntWritable(1));
values.add(new IntWritable(1));
mapReduceDriver.withOutput(new Text("6"), new IntWritable(2));
mapReduceDriver.runTest();
}
MRUnit supports two style of testings. First style is to tell the framework both input and output values and let the framework do the assertions, second is the more traditional approach where you do the assertion yourself. Lets write a test using the first approach.When testing a Reducer using MRUnit, you should only pass the Reducer a single keyand list of values. In this case, we use the withInput() method twice, but only the second call will actually be used -- the first will be overridden by the second. If you want to test the Reducer with two inputs, you would have to write two tests. Testing a Hadoop job requires a lot of effort not related to the job. You must configure it to run locally, create a sample input file, run the job on your sample input, and then compare to an expected output file. This not only takes time, but makes your tests run very slow due to all the file I/O. MRUnit is: a unit test library designed to facilitate easy integration between your MapReduce development process and standard development and testing tools such as JUnit With MRUnit, there are no test files to create, no configuration parameters to change, and generally less test code. You can cut the clutter and focus on the meat of your tests.

Watch the training Module 21 from http://hadoopexam.com/index.html/#hadoop-training

Related Questions

Question : In word count MapReduce algorithm, why might using a combiner (Combiner, runs after the Mapper and before the Reducer. )
reduce the overall job running time?

1. combiners perform local filtering of repeated word, thereby reducing the number of key-value pairs that need to be shuffled across the network to the reducers.
2. combiners perform global aggregation of word counts, thereby reducing the number of key-value pairs that need to be shuffled across the network to the reducers.
3. Access Mostly Uused Products by 50000+ Subscribers
4. combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be shuffled across the network to the reducers.

Question : The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat's
logical records are lines, which will cross HDFS boundaries more often than not. This has no bearing on the functioning of your
program-lines are not missed or broken, for example-but it's worth knowing about, as it does mean that data-local maps (that is,
maps that are running on the same host as their input data) will perform some remote reads. The slight overhead this causes is not
normally significant. With the latest version of Hadoop provided by Cloudera, which also include MR2.
You submitted a job to process www.HadoopExam.com single log file , which is made up of two blocks, named BLOCKX and BLOCKY.
BLOCKX is on nodeA, and is being processed by a Mapper running on that node. BLOCKY is on nodeB.
A record spans the two blocks that is, the first part of the record is in BLOCKX,
but the end of the record is in BLOCKY. What happens as the record is being read by the Mapper on NODEA?

1. The remaining part of the record is streamed across the network from either nodeA or nodeB
2. The remaining part of the record is streamed across the network from nodeA
3. Access Mostly Uused Products by 50000+ Subscribers
4. The remaining part of the record is streamed across the network from nodeB

Question : If you run the word count MapReduce program with m map tasks and r reduce tasks,
how many output files will you get at the end of the job, and how many key-value pairs will there be in each file?
Assume k is the number of unique words in the input files. (The word count program reads
text input and produces output that contains every distinct word and the number of times that word occurred anywhere in the text.)

1. There will be r files, each with approximately m/r key-value pairs.
2. There will be m files, each with approximately k/r key-value pairs.
3. Access Mostly Uused Products by 50000+ Subscribers
4. There will be r files, each with approximately k/m key-value pairs.

Question : While processing the MAIN.PROFILE.log generated in the Apache WebServer of the QuickTechie.com website using MapReduce job.
There are 100 nodes in the cluster and 3 reducers defined. Which of the reduce tasks will process a Text key which begins with the regular expression "\w+"?

1. First Reducer will process the key, wich satisfies the regular expression "\w+"
2. Second Reducer will process the key, wich satisfies the regular expression "\w+"
3. Access Mostly Uused Products by 50000+ Subscribers
4. Not enough data to determine which reduce task will receive which key

Question : To process the www.HadoopExam.com MAIN.PROFILE.log file You submit a job to a cluster running on MRv.
There are 1000 slave nodes in a 100 rack, You have NOT specified a rack topology script. Your job has a single Reducer which runs on Node7 of Rack7.
The output file it writes is small enough to fit in a single HDFS block. How does Hadoop handle writing the output file?

1. The first replica of the block will be stored in any node out of 1000 nodes.
2. The first replica of the block will be stored on node7 of Rack7 only. The other two replicas will be stored on other nodes in any rack.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The first replica of the block will be stored on node7 in rack7. The other two replicas will be stored on node6 and node8 in rack7

Question :

Let's assume you have following files in the hdfs directory called merge.
Test1.txt
hadoopexam.com Hadoop Training 1

Test2.txt
www.hadoopexam.com Hadoop YARN Training

Test3.txt
http://hadoopexam.com Amazon WebService Training

Now you run the following command
hadoop fs -getmerge merge/ output1.txt
What is the correct statement?

1. It will create a new file called output1.txt in local file system, with the merged content from the all three files
2. It will create a new file called output1.txt in hdfs file system, with the merged content from the all three files
3. Access Mostly Uused Products by 50000+ Subscribers
4. This command will successful but will not merge the files because of, what to do with new line character is not defined.