Cloudera Hadoop Developer Certification Questions and Answer (Dumps and Practice Questions)

Question :

Select the correct statement which applies for below DDL

CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE

1. the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime
2. an organization allows the user to do efficient sampling on the clustered column - in this case userid.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table - only how it is read.

1. 1,2,3
2. 2,3,4
3. Access Mostly Uused Products by 50000+ Subscribers
4. 1,2,3,4
5. 1,4

Correct Answer : Get Lastest Questions and Answer :

Explanation: Bucketed Sorted Tables
In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table - only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.

Question :

Which of the given is a correct code snippet of the Mapper,
for implementing word count example.

1. A
2. B
3. Access Mostly Uused Products by 50000+ Subscribers

Correct Answer : Get Lastest Questions and Answer :

Explanation: public static int frequency(Collection ?> c, Object o)
Returns the number of elements in the specified collection equal to the specified object. More formally, returns the number of elements e in the collection such that (o == null ? e == null : o.equals(e)).
Parameters:
c - the collection in which to determine the frequency of o
o - the object whose frequency is to be determined

Question : Select the correct statement while reading/writing the data in RDBMS using MapReduce

1. In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with
2. The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer
3. Access Mostly Uused Products by 50000+ Subscribers
4. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import.
5. All of the above

Correct Answer : Get Lastest Questions and Answer :
Explanation: First we'll cover how DBInputFormat interacts with databases. DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented, DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to connect to their databases. Links to popular drivers are listed in the resources section at the end of this post.
To start using DBInputFormat to connect to your database, you'll need to download the appropriate database driver from the list in the resources section (see the end of this post), and drop it into the $HADOOP_HOME/lib/ directory on your Hadoop TaskTracker machines, and on the machine where you launch your jobs from.
Reading Tables with DBInputFormat
The DBInputFormat is an InputFormat class that allows you to read data from a database. An InputFormat is Hadoop's formalization of a data source; it can mean files formatted in a particular way, data read from a database, etc. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. Most queries are supported, subject to a few limitations discussed at the end of this article.
Writing results back to the database
A companion class, DBOutputFormat, will allow you to write results back to a database. When setting up the job, callconf.setOutputFormat(DBOutputFormat.class); and then call DBConfiguration.configureDB() as before.
The DBOutputFormat.setOutput() method then defines how the results will be written back to the database. Its three arguments are the JobConf object for the job, a string defining the name of the table to write to, and an array of strings defining the fields of the table to populate. e.g., DBOutputFormat.setOutput(job, "employees", "employee_id", "name");.
The same DBWritable implementation that you created earlier will suffice to inject records back into the database. Thewrite(PreparedStatement stmt) method will be invoked on each instance of the DBWritable that you pass to the OutputCollector from the reducer. At the end of reducing, those PreparedStatement objects will be turned into INSERTstatements to run against the SQL database.
Limitations of the InputFormat
JDBC allows applications to generate SQL queries which are executed against the database; the results are then returned to the calling application. Keep in mind that you will be interacting with your database via repeated SQL queries. Therefore:
" Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce's query via a clause such as " yesterday>insert_date" or dumping the data to a temporary table in the database before launching your MapReduce process.
" In order to parallelize the processing of records from the database, Hadoop will execute SQL queries that useORDER BY, LIMIT, and OFFSET clauses to select ranges out of tables. Your results, therefore, need to be orderable by one or more keys (either PRIMARY, like the one in the example, or UNIQUE).
" In order to set the number of map tasks, the DBInputFormat needs to know how many records it will read. So if you're writing an arbitrary SQL query against the database, you will need to provide a second query that returns the number of rows that the first query will return (e.g., by using COUNT and GROUP BY).
With these restrictions in mind, there's still a great deal of flexibility available to you. You can bulk load entire tables into HDFS, or select large ranges of data. For example, if you want to read records from a table that is also being populated by another source concurrently, you might set up that table to attach a timestamp field to each record. Before doing the bulk read, pick the current timestamp, then select all records with timestamps earlier than that one. New records being fed in by the other writer will have later timestamps and will not affect the MapReduce job.
Finally, be careful to understand the bottlenecks in your data processing pipeline. Launching a MapReduce job with 100 mappers performing queries against a database server may overload the server or its network connection. In this case, you'll achieve less parallelism than theoretically possible, due to starvation, disk seeks, and other performance penalties.
Limitations of the OutputFormat
The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer. The reducer'sclose() method then executes them in a bulk transaction. Performing a large number of these from several reduce tasks concurrently can swamp a database. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import

Related Questions

Question : The Apache Hive data warehouse software facilitates querying and managing large datasets residing in
distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce
programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient
to express this logic in HiveQL. Select the correct statement regarding Hive from the below options ?

1. Hive comes with no additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via the Hive interpreter as well as some logic in memory.
2. Hive comes with additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via the Hive interpreter.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hive comes with no additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via the Hive interpreter.

Question : You've written a MapReduce job based on HadoopExam websites log file named MAIN.PROFILE.log file , resulting in an extremely
large amount of output data. Which of the following cluster resources will your job stress? ?

1. network I/O and disk I/O
2. network I/O and RAM
3. Access Mostly Uused Products by 50000+ Subscribers
4. RAM , network I/O and disk I/O

Question : You have written a Mapper which invokes the following five calls to the OutputCollector.collect method:

output.collect(new Text("Flag"), new Text("Rahul"));
output.collect(new Text("Shirt"), new Text("Yakul"));
output.collect(new Text("Shoe"), new Text("Rahul"));
output.collect(new Text("Flag"), new Text("Gemini"));
output.collect(new Text("Socks"), new Text("Yakul"));

How many times will the Reducer's reduce() method be invoked.

1. 5
2. 4
3. Access Mostly Uused Products by 50000+ Subscribers
4. 7
5. 8

Question : ___________ is an optimization technique where a computer system performs some task that may not be actually needed. The main idea is to
do work before it is known whether that work will be needed at all, so as to prevent a delay that would have to be incurred by doing the work after it
is known whether it is needed. If it turns out the work was not needed after all, the results are ignored. The Hadoop framework also provides a
mechanism to handle machine issues such as faulty configuration or hardware failure. The JobTracker detects that one or a number of
machines are performing poorly and starts more copies of a map or reduce task. This behaviour is known as ________________

1. Task Execution
2. Job Execution
3. Access Mostly Uused Products by 50000+ Subscribers
4. Speculative Execution

Question :
You are working in the HadoopExam consultency team and written a MapReduce and Pig job, which of the following is correct statement?

1. Pig comes with additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.
2. Pig comes with no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Pig comes with additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.

Question : Everyday HadoopExam has a good number of subscribers, but the file size created from this information is
smaller than 64MB, and same 64MB is configured as a block size on the cluster.
You are running a job that will process this file as a single input split on a cluster which has no other jobs currently running,
and with all settings at their default values. Each node has an equal number of open Map slots.
On which node will Hadoop first attempt to run the Map task?

1. The node containing the first TaskTracker to heartbeat into the JobTracker, regardless of the location of the input split
2. The node containing the first JobTracker to heartbeat into the Namenode, regardless of the location of the input split
3. Access Mostly Uused Products by 50000+ Subscribers
4. The node containing nearest location of the input split