Premium

Cloudera Hadoop Developer Certification Questions and Answer (Dumps and Practice Questions)



Question : What describes the relationship between MapReduce and Hive?
  : What describes the relationship between MapReduce and Hive?
1. Hive provides additional capabilities that allow certain types of data manipulation not possible with MapReduce.
2. Hive programs rely on MapReduce but are extensible, allowing developers to do special-purpose processing not provided by MapReduce.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hive provides no additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via the Hive interpreter.

Correct Answer : Get Lastest Questions and Answer :
Hive is a framework that translates queries written in Hive QL into jobs that are executed by the MapReduce framework. Hive does not provide any functionality that isn't provided by MapReduce, but it makes some types of data operations significantly easier to perform.






Question : What is HIVE?
  : What is HIVE?
1. HIVE is part of the Apache Hadoop project that enables in-memory analysis of real-time streams of data
2. Hive is a way to add data from local file system to HDFS
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hive is a part of the Apache Hadoop project that provides SQL like interface for data processing

Correct Answer : Get Lastest Questions and Answer :

Hive is a project initially developed by facebook specifically for people with very strong SQL skills and not very strong Java skills who want to query data in Hadoop




Question : Your cluster has slave nodes in three different racks, and you have written a rack topology script identifying each machine as being in hadooprack,
hadooprack2, or hadooprack3. A client machine outside of the cluster writes a small (one-block) file to HDFS. The first replica of the block is written
to a node on hadooprack2. How is block placement determined for the other two replicas?
  : Your cluster has slave nodes in three different racks, and you have written a rack topology script identifying each machine as being in hadooprack,
1. One will be written to another node on hadooprack2, and the other to a node on a different rack.

2. Either both will be written to nodes on hadooprack1, or both will be written to nodes on hadooprack3.

3. Access Mostly Uused Products by 50000+ Subscribers

4. One will be written to hadooprack1, and one will be written to hadooprack3.


Correct Answer : Get Lastest Questions and Answer :

Explanation: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

For the default threefold replication, Hadoop's rack placement policy is to write the first copy of a block on a node in one rack, then the other two copies on two nodes in a different rack. Since the first copy is written to hadooprack2, the other two will either be written to two nodes on hadoprack1, or two nodes on hadooprack3.

Replica Placement: The First Baby Steps

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.

Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.

The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

For the common case, when the replication factor is three, HDFSs placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

The current, default replica placement policy described here is a work in progress.



Related Questions


Question :
You are running a MapReduce job, and inside the Mapper you want to get the actual file name which is being processed,
what is the correct code snippet to fetch the filename in Mapper code

 :
1. String fileName = ((FileStatus) context.getFileStatus()).getPath().getName();
2. String fileName = context.getPath().getName();
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above



Question : In MapReduce word count,
you know your file contains the
maximum of three different words,
and after completion of the job
you want there one file will be
created for each reducer. Hence,
you have written a custom
partitioner, which is the correct
code snippet for above requirement.
  : In MapReduce word count,
1. A
2. B
3. Access Mostly Uused Products by 50000+ Subscribers


Question : Input file size (kb) is given, and block size is given (mb). What is the size of the intermediate data occupied.

 : Input file size (kb) is given, and block size is given (mb). What is the size of the intermediate data occupied.
1. 47KB
2. 83KB
3. Access Mostly Uused Products by 50000+ Subscribers
4. Job Fails


Question : You have two tables in Hive that are populated with data:
Employee
emp_id int
salary string

Employee_Detail;
emp_id int
name string

You now create a new table de-normalized one and populate it with the results of joining the two tables as follows:
CREATE TABLE EMPLOYEE_FULL AS SELECT Employee_Detail.*,Employee.salary AS s
FROM Employee JOIN Employee_Detail ON (Employee.emp_id== Employee_Detail.emp_id);

You then export the table and download the file:
EXPORT TABLE EMPLOYEE_FULL TO '/hadoopexam/employee/Employee_Detail.data';

You have downloaded the file and read the file as a CSV in R. How many columns will the resulting variable in R have?




 : You have two tables in Hive that are populated with data:
1. 1
2. 2
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4
5. 5

Ans : 1
Exp : : When exporting a table from Hive, the data file will use the delimiters form the table.
Because table3 wasn't created with specific delimiters, it will use the default Hive delimiter,
which is \001 or Control-A. When the file is imported into R as a CSV,
there will be only 1 column because the file isn't actually comma delimited.



Question : You use Sqoop to import a table from your RDBMS into HDFS.
You know that Sqoop typically instantiates four Mappers. However, after the table import,
you notice that five Mappers have run, there are five output files in HDFS, and one of the output files is empty. Why?
 : You have two tables in Hive that are populated with data:
1. The administrator has set the sqoop.num.maps property on the slave nodes to 7
2. Some Map tasks failed and had to be rerun
3. Access Mostly Uused Products by 50000+ Subscribers
4. The HDFS block size was set to a very small value, resulting in more Mappers than usual running
5. The table was modified by a user of the RDBMS as Sqoop was running

Ans : 3
Exp : If some Map task attempts failed, they would be rerun but no data from the failed task attempts would be stored on disk. There is no sqoop.num.maps property. Sqoop typically reads the table in a single transaction, so modifying the data would have no effect; and the HDFS block size is irrelevant to the number of files created. The correct answer is that by default, Sqoop uses the table's primary key to determine how to split the data. If there is no numeric primary key, Sqoop will make a best-guess attempt at how the data is distributed, and may run more than its default four Mappers, although some may end up not actually reading any data.


Question : Using Apache Sqoop you can import the data to


 : You have two tables in Hive that are populated with data:
1. Apache Hive
2. Apache HBase
3. Access Mostly Uused Products by 50000+ Subscribers
4. Only 1 and 3
5. All 1,2 and 3

Ans : 5
Exp : : Apache Sqoop can be used to import data from any relational DB into HDFS, Hive or HBase.
To import data into HDFS, use the sqoop import command and specify the relational DB table and connection parameters:

sqoop import --connect "JDBC connection string" --table "tablename" --username "username" --password "password"
This will import the data and store it as a CSV file in a directory in HDFS. To import data into Hive, use the sqoop import command and specify the option 'hive-import'.

sqoop import --connect "JDBC connection string" --table "tablename" --username "username" --password "password" --hive-importThis will import the data into a Hive table with the approproate data types for each column.


Question : You decide to use Hive to process data in HDFS. You have not created any Hive tables until now.
Hive is configured with its default settings. You run the following commands from the Hive shell:

CREATE DATABASE HADOOPEXAM;
USE HADOOPEXAM;
CREATE TABLE HE_EMPLOYEE(EMP_NAME STRING, EMP_ID INT, EMP_SALARY INT);


In which directory should you now store data so that it can be accessed as table t1 in database HADOOPEXAM?

 : You have two tables in Hive that are populated with data:
1. /user/hive/warehouse/HADOOPEXAM/HE_EMPLOYEE
2. /user/hive/warehouse/HADOOPEXAM.db/HE_EMPLOYEE
3. Access Mostly Uused Products by 50000+ Subscribers
4. /user/hive/HADOOPEXAM.db/HE_EMPLOYEE
5. /user/hive/warehouse/HADOOPEXAM HE_EMPLOYEE

Ans : 2
Exp : : When you create a database named HADOOPEXAM in Hive, that creates a subdirectory of Hive's warehouse directory named HADOOPEXAM.db.
All tables are placed in subdirectories of HADOOPEXAM.db; those subdirectory names are the names of the tables



Question :. For HadoopExam.com user profiles you need to analyze roughly ,, JPEG files of all the.
Each file is no more than 3kB.Because your Hadoop cluster isn't optimized for storing and processing many small files,
you decide to group the files into a single archive. The toolkit that will be used to process
the files is written in Ruby and requires that it be run with administrator privileges.
Which of the following file formats should you select to build your archive?

 : You have two tables in Hive that are populated with data:
1. TIFF
2. SequenceFiles
3. Access Mostly Uused Products by 50000+ Subscribers
4. MPEG
5. Avro

Ans : 5

Exp :The two formats that are best suited to merging small files into larger archives for processing in Hadoop are Avro and SequenceFiles. Avro has Ruby bindings; SequenceFiles are only supported in Java.

JSON, TIFF, and MPEG are not appropriate formats for archives. JSON is also not an appropriate format for image data.




Question : SequenceFiles are flat files consisting of binary key/value pairs. SequenceFile provides Writer, Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.
There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs:
You have created a SequenceFile (MAIN.PROFILE.log) with custom key and value types. What command displays the contents of a
SequenceFile named MAIN.PROFILE.log in your terminal in human-readable format?

 : You have two tables in Hive that are populated with data:
1. hadoop fs -decrypt MAIN.PROFILE.log
2. hadoop fs -text MAIN.PROFILE.log
3. Access Mostly Uused Products by 50000+ Subscribers
4. hadoop fs -encode MAIN.PROFILE.log




Question : Speculative execution is an optimization technique where a computer system performs
some task that may not be actually needed. The main idea is to do work before it is known whether that work will be needed at all,
so as to prevent a delay that would have to be incurred by doing the work after it is known whether it is needed. If it turns out the work was not needed
after all, any changes made by the work are reverted and the results are ignored. In a ETL MapReduce job which will use Mappers to process data
and then using DBMSOutputFormat with the Reducers you directly push to Oracle database. Select the correct statement which applies for
speculative execution.

 : Speculative execution is an optimization technique where a computer system performs
1. Disable speculative execution for the data insert job
2. Enable speculative execution for the data insert job
3. Access Mostly Uused Products by 50000+ Subscribers
4. Configure only single mapper for the data insert job




Question : Apache MRUnit is a Java library that helps developers unit test Apache Hadoop map reduce jobs.
MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20 , 0.23.x , 1.0.x , 2.x version of Hadoop
You have a Reducer which simply sums up the values for any given key. You write a unit test in MRUnit to test the Reducer, with this code:
@Test
public void testETLReducer() {
List < IntWritable > values = new ArrayList < IntWritable > ();
values.add(new IntWritable(1));
values.add(new IntWritable(1));
List < IntWritable > values2 = new ArrayList < IntWritable > ();
values2.add(new IntWritable(1));
values2.add(new IntWritable(1));
reduceDriver.withInput(new LongWritable("5673"), values);
reduceDriver.withInput(new LongWritable("109098"), values2);
reduceDriver.withOutput(new LongWritable("109098"), new IntWritable(2));
reduceDriver.runTest();
} What is the result?


 : Apache MRUnit is a Java library that helps developers unit test Apache Hadoop map reduce jobs.
1. The test will pass with warning and error
2. The test will pass with no warning and error
3. Access Mostly Uused Products by 50000+ Subscribers
4. Code will not compile