Mapr (HP) Hadoop Developer Certification Questions and Answers (Dumps and Practice Questions)

Question :

Which of the given is a correct code snippet of the Mapper,
for implementing word count example.

1. A
2. B
3. Access Mostly Uused Products by 50000+ Subscribers

Correct Answer : Get Lastest Questions and Answer :

Explanation: public static int frequency(Collection ?> c, Object o)
Returns the number of elements in the specified collection equal to the specified object. More formally, returns the number of elements e in the collection such that (o == null ? e
== null : o.equals(e)).
Parameters:
c - the collection in which to determine the frequency of o
o - the object whose frequency is to be determined

Question : Select the correct statement while reading/writing the data in RDBMS using MapReduce

1. In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with
2. The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer
3. Access Mostly Uused Products by 50000+ Subscribers
4. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import.
5. All of the above

Correct Answer : Get Lastest Questions and Answer :
Explanation: First we'll cover how DBInputFormat interacts with databases. DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented,
DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to
connect to their databases. Links to popular drivers are listed in the resources section at the end of this post.
To start using DBInputFormat to connect to your database, you'll need to download the appropriate database driver from the list in the resources section (see the end of this post),
and drop it into the $HADOOP_HOME/lib/ directory on your Hadoop TaskTracker machines, and on the machine where you launch your jobs from.
Reading Tables with DBInputFormat
The DBInputFormat is an InputFormat class that allows you to read data from a database. An InputFormat is Hadoop's formalization of a data source; it can mean files formatted in a
particular way, data read from a database, etc. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL
queries performed against the database. Most queries are supported, subject to a few limitations discussed at the end of this article.
Writing results back to the database
A companion class, DBOutputFormat, will allow you to write results back to a database. When setting up the job, callconf.setOutputFormat(DBOutputFormat.class); and then call
DBConfiguration.configureDB() as before.
The DBOutputFormat.setOutput() method then defines how the results will be written back to the database. Its three arguments are the JobConf object for the job, a string defining the
name of the table to write to, and an array of strings defining the fields of the table to populate. e.g., DBOutputFormat.setOutput(job, "employees", "employee_id", "name");.
The same DBWritable implementation that you created earlier will suffice to inject records back into the database. Thewrite(PreparedStatement stmt) method will be invoked on each
instance of the DBWritable that you pass to the OutputCollector from the reducer. At the end of reducing, those PreparedStatement objects will be turned into INSERTstatements to run
against the SQL database.
Limitations of the InputFormat
JDBC allows applications to generate SQL queries which are executed against the database; the results are then returned to the calling application. Keep in mind that you will be
interacting with your database via repeated SQL queries. Therefore:
" Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect
the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce's query via a clause
such as " yesterday>insert_date" or dumping the data to a temporary table in the database before launching your MapReduce process.
" In order to parallelize the processing of records from the database, Hadoop will execute SQL queries that useORDER BY, LIMIT, and OFFSET clauses to select ranges out of tables.
Your results, therefore, need to be orderable by one or more keys (either PRIMARY, like the one in the example, or UNIQUE).
" In order to set the number of map tasks, the DBInputFormat needs to know how many records it will read. So if you're writing an arbitrary SQL query against the database, you will
need to provide a second query that returns the number of rows that the first query will return (e.g., by using COUNT and GROUP BY).
With these restrictions in mind, there's still a great deal of flexibility available to you. You can bulk load entire tables into HDFS, or select large ranges of data. For example,
if you want to read records from a table that is also being populated by another source concurrently, you might set up that table to attach a timestamp field to each record. Before
doing the bulk read, pick the current timestamp, then select all records with timestamps earlier than that one. New records being fed in by the other writer will have later
timestamps and will not affect the MapReduce job.
Finally, be careful to understand the bottlenecks in your data processing pipeline. Launching a MapReduce job with 100 mappers performing queries against a database server may
overload the server or its network connection. In this case, you'll achieve less parallelism than theoretically possible, due to starvation, disk seeks, and other performance
penalties.
Limitations of the OutputFormat
The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer. The reducer'sclose() method then executes them in a bulk transaction. Performing a
large number of these from several reduce tasks concurrently can swamp a database. If you want to export a very large volume of data, you may be better off generating the INSERT
statements into a text file, and then using a bulk data import tool provided by your database to do the database import

Question :

You have following data in a hive table

ID:INT,COLOR:TEXT,WIDTH:INT
1,green,190
2,blue,300
3,yellow,299
4,blue,199
5,green,199
6,yellow,299
7,green,799
8,red,800

Select the correct MapReduce program which can produce the output similar to below Hive Query.

Select `(green|blue)?+.+` from table;

1. 1
2. 2
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4

Correct Answer : Get Lastest Questions and Answer :

Correct Answer : Get Lastest Questions and Answer : In this Mapper will produce the following output

A yellow,299
A yellow,299
A red,800

In the reducer it is just emitting values, as below. which is the desired output.

red,800
yellow,299
yellow,299

Print All output without green and blue

Related Questions

Question : You want to run Hadoop jobs on your development workstation for testing before you
submit them to your production cluster. Which mode of operation in Hadoop allows you to
most closely simulate a production cluster while using a single machine?

1. Run all the nodes in your production cluster as virtual machines on your development workstation.
2. Run the hadoop command with the -jt local and the -fs file:///options.
3. Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine.
4. Run simldooop, the Apache open-source software for simulating Hadoop clusters.

Question : Assuming default settings, which best describes the order of data provided to a reducers reduce method:

1. The keys given to a reducer aren't in a predictable order, but the values associated with those keys always are.
2. Both the keys and values passed to a reducer always appear in sorted order.
3. Neither keys nor values are in any predictable order.
4. The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order

Question : Which HDFS command displays the contents of the file x in the user's HDFS home directory?

1. hadoop fs -Is x

2. hdfs fs -get x

3. hadoop fs -cat x

4. hadoop fs -cp x

Question : Using Hadoop mapreduce framework, you have to use Unix /bin/cat command as a Mapper and /bin/wc as a reducer. Select the correct option from below command.

1. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapR /bin/cat -reducer /bin/wc
2. $HADOOP_HOME/bin/hadoop \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
3. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-map /bin/cat \
-red /bin/wc
4. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Question : Below snippet submits new streaming job for Hadoop MapReduce.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Now input line submitted as below.

We are learning BigData
Training provided by HadoopExam.com
MapReduce is nice to learn

While processing above line what will be the Keys.

1. [W,T,M]

2. Size of each line in Bytes e.g. [1000, 1050, 900]

3. Access Mostly Uused Products by 50000+ Subscribers

4. Entire content of the line [We are learning BigData, Training provided by HadoopExam.com, MapReduce is nice to learn]

Question : Below snippet submits new streaming job for Hadoop MapReduce.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Now input line submitted as below.

We [tab]are learning BigData
Training [tab] provided by HadoopExam.com
Map[tab]Reduce is nice to learn

While processing above line what will be the Keys.

1. [W,T,M]

2. Size of each line in Bytes e.g. [1000, 1050, 900]

3. Access Mostly Uused Products by 50000+ Subscribers

4. Entire content of the line [We are learning BigData, Training provided by HadoopExam.com, MapReduce is nice to learn]

5. [We, Training, Map]