Question : The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Select the correct statement regarding Hive from the below options ? 1. Hive comes with no additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via the Hive interpreter as well as some logic in memory. 2. Hive comes with additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via the Hive interpreter. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Hive comes with no additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via the Hive interpreter.
Explanation: Hive is a framework that translates queries written in Hive QL into jobs that are executed by the MapReduce framework. Hive does not provide any functionality that isn't provided by MapReduce, but it makes some types of data operations significantly easier to perform. For an overview of Hive, see chapter 12 of Hadoop: The Definitive Guide, 3rd Edition or Hive - A Warehousing Solution Over a Map-Reduce Framework.Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL[6] with schema on read and transparently converts queries to map/reduce, Apache Tez and in the future Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes. By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.
Currently, there are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE. Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13. Other features of Hive include: Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10, more index types are planned. Different storage types such as plain text, RCFile, HBase, ORC, and others. Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution. Operating on compressed data stored into Hadoop ecosystem, algorithm including gzip, bzip2, snappy, etc. Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.SQL-like queries (HiveQL), which are implicitly converted into MapReduce jobs. HiveQL[edit] While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited subquery support.[18][19] Support for insert, update, and delete with full ACID functionality was made available with release 14.[20]
Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.[21]
Watch the module : 12 and 13 from http://hadoopexam.com/index.html/#hadoop-training
Question : You've written a MapReduce job based on HadoopExam websites log file named MAIN.PROFILE.log file , resulting in an extremely large amount of output data. Which of the following cluster resources will your job stress? ? 1. network I/O and disk I/O 2. network I/O and RAM 3. Access Mostly Uused Products by 50000+ Subscribers 4. RAM , network I/O and disk I/O
Correct Answer : Get Lastest Questions and Answer : Explanation: MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all our nodes. Running a MapReduce program involves running mapping tasks on many or all of the nodes in our cluster. Each of these mapping tasks is equivalent: no mappers have particular "identities" associated with them. Therefore, any mapper can process any input file. Each mapper loads the set of files local to that machine and processes them. When the mapping phase has completed, the intermediate (key, value) pairs must be exchanged between machines to send all values with the same key to a single reducer. The reduce tasks are spread across the same nodes in the cluster as the mappers. This is the only communication step in MapReduce. Individual map tasks do not exchange information with one another, nor are they aware of one another's existence. Similarly, different reduce tasks do not communicate with one another. The user never explicitly marshals information from one machine to another; all data transfer is handled by the Hadoop MapReduce platform itself, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce's reliability. If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully.When denormalizing a table, the amount of data written by the map phase will far exceed the amount of data read by the map phase. All of the data written during the map phase is first written to local disk and then transferred over the network to the reducers during the beginning of the reduce phase. Writing a very large amount of data in the map phase will therefore create a large amount of local disk I/O on the machines running map tasks and network I/O. Because map output is stored in a fixed size buffer that is written periodically to disk, this operation will not tax the memory of the machines running the map tasks. Denormalizing a table is not a compute-intesive operation, so this operation will not tax the processors of the machines running the map tasks. Partition & Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data. The Partitioner class determines which partition a given (key, value) pair will go to. The default partitioner computes a hash value for the key and assigns the partition based on this result. Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. Reduce: A Reducer instance is created for each reduce task. This is an instance of user-provided code that performs the second important phase of job-specific work. For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once. This receives a key as well as an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The Reducer also receives as parameters OutputCollector and Reporter objects; they are used in the same manner as in the map() method.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
Question : You have written a Mapper which invokes the following five calls to the OutputCollector.collect method:
output.collect(new Text("Flag"), new Text("Rahul")); output.collect(new Text("Shirt"), new Text("Yakul")); output.collect(new Text("Shoe"), new Text("Rahul")); output.collect(new Text("Flag"), new Text("Gemini")); output.collect(new Text("Socks"), new Text("Yakul"));
How many times will the Reducer's reduce() method be invoked.
Explanation: Keys and values: In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values. For example, a log of time-coded speedometer readings from multiple cars could be keyed by license-plate number; it would look like:
AAA-123 65mph, 12:00pm ZZZ-789 50mph, 12:02pm AAA-123 40mph, 12:05pm CCC-456 25mph, 12:15pm ... The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same: both a key and a value must be emitted to the next list in the data flow. MapReduce is also less strict than other languages about how the Mapper and Reducer work. In more formal functional mapping and reducing settings, a mapper must produce exactly one output element for each input element, and a reducer must produce exactly one output element for each input list. In MapReduce, an arbitrary number of values can be output from each phase; a mapper may map one input into zero, one, or one hundred outputs. A reducer may compute over an input list and emit one or a dozen different outputs.After the shuffle and sort phase there will be 4 calls to the reduce method, one being passed "Flag" as the key and "Rahul, Gemini" as values, one bing passed "Shirt" as the key and "Yakul" as the value, one being passed "Shoe" as the key and "Rahul" as the value and one being passed "Socks" as the key and "Yakul" as the value.
Watch the training from http://hadoopexam.com/index.html/#hadoop-training
1. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable objects. 2. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable objects. 3. Access Mostly Uused Products by 50000+ Subscribers 4. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type.
1. The number of values across different keys in the iterator supplied to a single reduce method call. 2. The amount of intermediate data that must be transfrred between the mapper and reducer. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The number of output files a reducer must produce.