Question : Determine which best describes when the reduce method is first called in a MapReduce job?
1. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins. 2. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted. 3. Access Mostly Uused Products by 50000+ Subscribers optimal performance for map-only or reduce-only jobs. 4. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.
To chain multiple MapReduce job, the driver at each job will have to create a new JobConf object and needs to set input path to be the output path of pervious job.
1. It is the newer version of MapReduce, using this performance of the data processing can be increased. 2. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling or monitoring, into separate daemons. 3. Access Mostly Uused Products by 50000+ Subscribers 4. All of the above 5. Only 2 and 3 are correct Ans : 5 Exp : MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling or monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
Question : Which is the component of the ResourceManager
Exp : The ResourceManager has two main components: Scheduler and ApplicationsManager. The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.
Question :
Scheduler of Resource Manager guarantees about restarting failed tasks either due to application failure or hardware failures. 1. True 2. False 1. True 2. False Ans : 2 Exp : The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc.
Question :
Which statement is true about ApplicationsManager
1. is responsible for accepting job-submissions 2. negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure. 3. Access Mostly Uused Products by 50000+ Subscribers 4. All of the above 5. 1 and 2 are correct Ans : 5 Exp : The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
Question :
Which java class represent a directory or file to get the metadata information e.g BlockSize, replication factor, ownership and permission etc.
1. True 2. Flase Ans : 2 Exp : NameNode does not store block locations persistently, since this information is reconstructed from datanodes when system starts.
Question :
Which tool is used to list all the blocks of a file ?
Question : HDFS can not store a file which size is greater than one node disk size :
1. True 2. False Ans : 2 Exp : It can store because it is divided in block and block can be stored anywhere..
Question : When is the earliest point at which the reduce method of a given Reducer can be called?
1. As soon as at least one mapper has finished processing its input split. 2. As soon as a mapper has emitted at least one record. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It depends on the InputFormat used for the job. Ans : 3 Exp : In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. Typically, keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.
Question :Which describes how a client reads a file from HDFS?
1. The client queries the NameNode for the block locations. The NameNode returns the block locations to the client. The client reads the data directory off the DataNodes. 2. The client queries all DataNodes in parallel. The DataNode that contains the requested data responds directly to the client. The client reads the data directly off the DataNode. 3. Access Mostly Uused Products by 50000+ Subscribers queries the DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode redirects the client to the DataNode that holds the requested data blocks. The client then reads the data directly off the DataNode. 4. The client contacts the NameNode for the block locations. The NameNode contacts the DataNode that holds the requested data block. Data is transferred from the DataNode to the NameNode, and then from the NameNode to the client. Ans : 1 Exp : The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add or copy or move or delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.
Question : You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?
Question : How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?
1. Keys are presented to reducer in sorted order; values for a given key are not sorted. 2. Keys are presented to reducer in sorted order; values for a given key are sorted in ascending order. 3. Access Mostly Uused Products by 50000+ Subscribers sorted. 4. Keys are presented to a reducer in random order; values for a given key are sorted in ascending order. Ans : 1 Explanation: Reducer has 3 primary phases: 1.Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network. 2.Sort The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged. SecondarySort To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce. 3. Access Mostly Uused Products by 50000+ Subscribers In this phase the reduce(Object, Iterable, Context) method is called for each {key, (collection of values)}in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not re-sorted. Reference: org.apache.hadoop.mapreduce,Class Reducer{KEYIN,VALUEIN,KEYOUT,VALUEOUT}
Question : You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters. Identify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4
1. You will have forty-eight failed task attempts 2. You will have seventeen failed task attempts 3. Access Mostly Uused Products by 50000+ Subscribers 4. You will have twelve failed task attempts 5. You will have twenty failed task attempts Ans : 5 There will be four failed task attempts for each of the five file splits.
Question : You want to populate an associative array in order to perform a map-side join. You’ve decided to put this information in a text file, place that file into the DistributedCache and read it in your Mapper before any records are processed.
Identify which method in the Mapper you should use to implement code for reading the file and populating the associative array?
Question : You are running a job that will process a single InputSplit on a cluster which has no other jobs currently running. Each node has an equal number of open Map slots. On which node will Hadoop first attempt to run the Map task?
1. The node with the most memory 2. The node with the lowest system load 3. Access Mostly Uused Products by 50000+ Subscribers 4. The node with the most free local disk space Ans : 3 Exp : Hadoop tries data localization processing, it means whenever there is a possibility JobTracker will try to start the task where data block is available.
Question : Assuming default settings, which best describes the order of data provided to a reducer’s reduce method
1. The keys given to a reducer aren’t in a predictable order, but the values associated with those keys always are. 2. Both the keys and values passed to a reducer always appear in sorted order. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order Ans : 4 Exp :
Question : In a MapReduce job which process millions of input records and generated the same amount of key-value pairs (In millions). The data is not uniformly distributed. Hence MapReduce job is going to create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network?
1. Partitioner 2. OutputFormat 3. Access Mostly Uused Products by 50000+ Subscribers 4. Combiner Ans : 4 Exp : Using the combiner (Which is generally same code as Reducer) we can reduce the transfer of data over the network during shuffle step.
Question : Joining of two large datasets in MapReduce is possible like RDBMS tables sharing a key? Assume that the two datasets are in CSV format stored in HDFS.
1. Yes 2. Yes, but only if one of the tables fits into memory 3. Access Mostly Uused Products by 50000+ Subscribers 4. No, MapReduce cannot perform relational operations. 5. No, but it can be done with either Pig or Hive.
Ans : 1 Exp :Yes, it is possible to join large datasets in Hadoop MapReduce, there are various approaches available for Joining the data in MapReduce e.g. Map-Side, Reduce-side and using Distributed Cache etc.
Question : You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper's map method?
1. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk 2. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer 5. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.
Ans : 3 Exp :The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes
Question : You want to understand more about how users browse your public website named MapRPublic.com, such as which pages they visit prior to placing an order. You have a farm of 500 tomcat and JBosss servers, on Cloud (Microsoft Azure) hosting your website. How will you gather this data for your analysis?
1. Ingest the server web logs into HDFS using Flume. 2. Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Channel these clickstreams into Hadoop using Hadoop Streaming. 5. Sample the weblogs from the web servers, copying them into Hadoop using curl.
Ans : 1 Exp : Because flume is the tool which is used to insert the webserver logs in HDFS. However, the parsing example is given because you can understand how real parsing can be done once data ingested using the flume in the HDFS.
Hadoop MapReduce for Parsing Weblogs Here are the steps for parsing a log file using Hadoop MapReduce: Load log files into the HDFS location using this Hadoop command: hadoop fs -put (local file path of weblogs) (hadoop HDFS location) The Opencsv2.3.jar framework is used for parsing log records. Below is the Mapper program for parsing the log file from the HDFS location.
public static class ParseMapper extends Mapper(Object, Text, NullWritable,Text ){ private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { CSVParser parse = new CSVParser(' ','\"'); String sp[]=parse.parseLine(value.toString()); int spSize=sp.length; StringBuffer rec= new StringBuffer(); for(int i=0;i lessthan spSize;i++){ rec.append(sp[i]); if(i!=(spSize-1)) rec.append(","); } word.set(rec.toString()); context.write(NullWritable.get(), word); }}
The command below is the Hadoop-based log parse execution. TheMapReduce program is attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system. hadoop jar (path of logparse jar) (hadoop HDFS logfile path) (output path of parsed log file) The output file is stored in the HDFS location, and the output file name starts with "part-".
Question : You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you've decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface. Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?
Ans : 3 Exp :Configure the property using the -D key=value notation: -D mapred.job.name='My Job' You can list a whole bunch of options by calling the streaming jar with just the -info argument Python hadoop streaming : Setting a job name
Question : You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product identifies (Text). Identify what determines the data types used by the Mapper for a given job.
1. The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods 2. The data types specified in HADOOP_MAP_DATATYPES environment variable 3. Access Mostly Uused Products by 50000+ Subscribers 4. The InputFormat used by the job determines the mapper's input key and value types.
Ans : 4 Exp : The input types fed to the mapper are controlled by the InputFormat used. The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.
Note:The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass(). By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override these.
1. As many intermediate key-value pairs as designed. There are no restrictions on the types of those keyvalue pairs (i.e., they can be heterogeneous). 2. As many intermediate key-value pairs as designed, but they cannot be of the same type as the input keyvalue pair. 3. Access Mostly Uused Products by 50000+ Subscribers 4. One intermediate key-value pair, but of the same type. 5. As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.
Ans : 5 Exp : Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
Question : You have the following key-value pairs as output from your Map task: (the, 1) (fox, 1) (faster, 1) (than, 1) (the, 1) (dog, 1) How many keys will be passed to the Reducer's reduce method?
Question : You have user profile records in your OLPT database, that you want to join with web logs you have already ingested into the Hadoop file system. How will you obtain these user records?
Ans : 3 Exp :Apache Hadoop and Pig provide excellent tools for extracting and analyzing data from very large Web logs. We use Pig scripts for sifting through the data and to extract useful information from the Web logs. We load the log file into Pig using the LOAD command. raw_logs = LOAD 'apacheLog.log' USING TextLoader AS (line:chararray); Note 1: Data Flow and Components *Content will be created by multiple Web servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers; these are machines that collect data intermediately using collectors and finally push that data to HDFS. *Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job solution). These scripts actually analyze the logs on various dimensions and extract the results. Results from Pig are by default inserted into HDFS, but we can use storage implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please see the implementation section). Pig Scripts can either push this data to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts can push this data into HBase directly. In this article, we use scripts to push data onto HDFS, as we are showcasing the Pig framework applicability for log analysis at large scale. *The database HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing. *The data-access Web service is a REST-based service that eases the access and integrations with data clients. The client can be in any language to access REST-based API. These clients could be BI- or UI-based clients. Note 2: The Log Analysis Software Stack *Hadoop is an open source framework that allows users to process very large data in parallel. It's based on the framework that supports Google search engine. The Hadoop core is mainly divided into two modules: 1.HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster. 2.Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS. *The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets -- HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple versions of a single row. *The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language. *Flume is a distributed, reliable and available service for collecting, aggregating and moving a large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for further processing. It's a data flow solution, where there is an originator and destination for each node and is divided into Agent and Collector tiers for collecting logs and pushing them to destination storage.
Question : What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?
1. You will not be able to compress the intermediate data. 2. You will longer be able to take advantage of a Combiner. 3. Access Mostly Uused Products by 50000+ Subscribers 4. There are no concerns with this approach. It is always advisable to use multiple reduces.
Ans : 3 Exp :your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred- site.xml has been set to a number larger than 1, or because you've used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can't concatenate your output files to create a single sorted output file. To do this you'll need total ordering,
Question : Given a directory of files with the following structure: line number, tab character, string: Example: 1filedata1 2filedata2 3filedata3 You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?
Ans : 3 Exp :KeyValueTextInputFormat TextInputFormats keys, being simply the offset within the file, are not normally very useful.It is common for each line in a file to be a key value pair, separated by a delimiter such as a tab character. For example, this is the output produced by TextOutputFormat. Hadoop File System default output format. To interpret such files correctly, KeyValueTextInputFormat is appropriate. You can specify the separator via the mapreduce.input.keyvaluelinerecordreader.key.value.separator property or key.value.separator.in.input.line in the old API It is a tab character by default. Consider the following input file, where space represent a horizontal tab character line1 On the top of the Crumpetty Tree line2 The Quangle Wangle sat, line3 But his face you could not see, line4 On account of his Beaver Hat. Like in the TextInputFormat case, the input is in a single split comprising four records,although this time the keys are the Text sequences before the tab in each line: (line1, On the top of the Crumpetty Tree) (line2, The Quangle Wangle sat,) (line3, But his face you could not see,) (line4, On account of his Beaver Hat.) SequenceFileInputFormat To use data from sequence files as the input to MapReduce, you use SequenceFileInputFormat. The keys and values are determined by the sequence file, and you need to make sure that your map input types correspond
Question : For each intermediate key, each reducer task can emit:
1. As many final key-value pairs as desired. There are no restrictions on the types of those key- value pairs (i. e., they can be heterogeneous). 2. As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs. 3. Access Mostly Uused Products by 50000+ Subscribers have the same type. 4. One final key-value pair per value associated with the key; no restrictions on the type. 5. One final key-value pair per key; no restrictions on the type.
Ans : 3 Exp :Reducer reduces a set of intermediate values which share a key to a smaller set of values. Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.
Question : What data does a Reducer reduce method process?
Ans : 3 Exp :Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value. All values with the same key are presented to a single reduce task.
Question : All keys used for intermediate output from mappers must:
Ans : 3 Exp :The MapReduce framework operates exclusively on (key, value) pairs, that is, the framework views the input to the job as a set of (key, value) pairs and produces a set of (key, value) pairs as the output of the job, conceivably of different types. The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Question : On a cluster running MapReduce v (MRv), a TaskTracker heartbeats into the JobTracker on your cluster, and alerts the JobTracker it has an open map task slot. What determines how the JobTracker assigns each map task to a TaskTracker?
1. The amount of RAM installed on the TaskTracker node. 2. The amount of free disk space on the TaskTracker node. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The average system load on the TaskTracker node over the past fifteen (15) minutes. 5. The location of the InsputSplit to be processed in relation to the location of the node.
Ans : 5 Exp :The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
Question : A client application creates an HDFS file named foo.txt with a replication factor of . Identify which best describes the file access rules in HDFS if the file has a single block that is stored on data nodes A, B and C?
1. The file will be marked as corrupted if data node B fails during the creation of the file. 2. Each data node locks the local file to prohibit concurrent readers and writers of the file. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Each data node stores a copy of the file in the local file system with the same name as the HDFS file. 5. The file can be accessed if at least one of the data nodes storing the file is available.
Ans : 5 Exp :HDFS keeps three copies of a block on three different datanodes to protect against true data corruption. HDFS also tries to distribute these three replicas on more than one rack to protect against data availability issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is sufficient to avoid corrupted files. Note: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack.
Question : Which process describes the lifecycle of a Mapper?
1. The JobTracker calls the TaskTracker's configure () method, then its map () method and finally its close () method. 2. The TaskTracker spawns a new Mapper to process all records in a single input split. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The JobTracker spawns a new Mapper to process all records in a single file.
Ans : 3 Exp :For each map instance that runs, the TaskTracker creates a new instance of your mapper. Note: *The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat. The mapper may perform a number of Extraction and Transformation functions on the Key/Value pair before ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Value type. *With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class. This class defines an 'Identity' map function by default - every input Key/Value pair obtained from the InputFormat is written out. Examining the run() method, we can see the lifecycle of the mapper: /** * Expert users can override this method for more complete control over the * execution of the Mapper. * @param context * @throws IOException */ public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); } setup(Context) - Perform any setup for the mapper. The default implementation is a no-op method. map(Key, Value, Context) - Perform a map operation in the given Key / Value pair. The default implementation calls Context.write(Key, Value) cleanup(Context) - Perform any cleanup for the mapper. The default implementation is a no-op method.
Question : To process input key-value pairs, your mapper needs to lead a MB data file in memory. What is the best way to accomplish this?
1. Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the mapper. 2. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper.
Ans : 4 Exp :Hadoop has a distributed cache mechanism to make available file locally that may be needed by Map/ Reduce jobs Use Case Lets understand our Use Case a bit more in details so that we can follow-up the code snippets. We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need to replace all keywords that we encounter during parsing, with some other value. So what we need is A key-values files (Lets use a Properties files) The Mapper code that uses the code Write the Mapper code that uses it view source
public class DistributedCacheMapper extends Mapper(LongWritable, Text, Text, Text) { Properties cache; protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration()); if(localCacheFiles != null) { // expecting only single file here for (int i = 0; i lessthan localCacheFiles.length; i++) { Path localCacheFile = localCacheFiles[i]; cache = new Properties(); cache.load(new FileReader(localCacheFile.toString())); }} else { // do your error handling here }} public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // use the cache here // if value contains some attribute, cache.get(value) // do some action or replace with something else }} Note: * Distribute application-specific large, read-only files efficiently. DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications. Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via hdfs:// urls are already present on the FileSystem at the path specified by the url.
Question : In a MapReduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values?
1. The values are in sorted order. 2. The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values. Ans : 2 Exp :*Input to the Reducer is the sorted output of the mappers. *The framework calls the application's Reduce function once for each unique key in the sorted order. *Example: For the given sample input the first map emits: (Hello, 1) ( World, 1) ( Bye, 1) ( World, 1) The second map emits: ( Hello, 1) ( Hadoop, 1) ( Goodbye, 1) ( Hadoop, 1)
Question : You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?
Question : You want to count the number of occurrences for each unique word in the supplied input data. You've decided to implement this by having your mapper tokenize each word and emit a literal value 1, and then have your reducer increment a counter for each literal 1 it receives. After successful implementing this, it occurs to you that you could optimize this by specifying a combiner. Will you be able to reuse your existing Reduces as your combiner in this case and why or why not?
1. Yes, because the sum operation is both associative and commutative and the input and output types to the reduce method match. 2. No, because the sum operation in the reducer is incompatible with the operation of a Combiner. 3. Access Mostly Uused Products by 50000+ Subscribers 4. No, because the Combiner is incompatible with a mapper which doesn't use the same data type for both the key and value. 5. Yes, because Java is a polymorphic object-oriented language and thus reducer code can be reused as a combiner.
Ans : 1 Exp :Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.
Question : Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot schedule a MapReduce operation.
Ans : 4 Exp : JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions(from Hadoop Wiki:) Client applications submit jobs to the Job tracker. The JobTracker talks to the NameNode to determine the location of the data The JobTracker locates TaskTracker nodes with available slots at or near the data The JobTracker submits the work to the chosen TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. When the work is completed, the JobTracker updates its status. Client applications can poll the JobTracker for information.
Question : Which project gives you a distributed, Scalable, data store that allows you random, real-time read/write access to hundreds of terabytes of data?
Ans : 1 Exp : Use Apache HBase when you need random, realtime read/write access to your Big Data.
This HBase goal is the hosting of very large tables
- billions of rows X millions of columns - atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column oriented store modeled after Google's Bigtable: - A Distributed Storage System for Structured Data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Features of HBases - Linear and modular scalability. - Strictly consistent reads and writes. - Automatic and configurable sharding of tables - Automatic failover support between RegionServers. - Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. - Easy to use Java API for client access. - Block cache and Bloom Filters for real-time queries. - Query predicate push down via server side Filters - Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options - Extensible jruby-based (JIRB) shell - Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
Question : Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and generate Java classes to interact with that imported data?
Exp :Sqoop ("SQL-to-Hadoop") is a straightforward command-line tool with the following capabilities: Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact with your imported data Provides the ability to import from SQL databases straight into your Hive data warehouse
Data Movement Between Hadoop and Relational Databases Data can be moved between Hadoop and a relational database as a bulk data transfer, or relational tables can be accessed from within a MapReduce map function. Note:
* Hadoop provides a bulk data transfer tool (i.e., Sqoop) that imports individual tables or entire databases into HDFS files. The tool also generates Java classes that support interaction with the imported data. Sqoop supports all relational databases over JDBC, and Quest Software provides a connector (i.e., OraOop) that has been optimized for access to data residing in Oracle databases.
Question : You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt and #data. txt. How many files will be processed by the FileInputFormat.setInputPaths () command when it's given a path object representing this directory?
1. Four, all files will be processed 2. Three, the pound sign is an invalid character for HDFS file names 3. Access Mostly Uused Products by 50000+ Subscribers 4. None, the directory cannot be named jobdata 5. One, no special characters can prefix the name of an input file
Ans : 3 Exp :Files starting with '_' are considered 'hidden' like unix files startingwith '.'. # characters are allowed in HDFS file names.
Question : You write MapReduce job to process files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.
1. There is no difference in output between the two settings. 2. With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS. 3. Access Mostly Uused Products by 50000+ Subscribers reducer, instances of matching patterns are stored in multiple files on HDFS. 4. With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.
Ans : 4 Exp :It is legal to set the number of reduce-tasks to zero if no reduction is desired. In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem. *Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job. reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
Reduce In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each (key, (list of values)) pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via OutputCollector.collect (WritableComparable, Writable). Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The output of the Reducer is not sorted.
Question : In a MapReduce job with map tasks, how many map task attempts will there be?
Task attempt is a particular instance of an attempt to execute a task ?
There will be at least as many task attempts as there are tasks If a task attempt fails, another will be started by the JobTracker Speculative execution can also result in more task attempts than completed tasks
Question : What types of algorithms are difficult to express in MapReduce v (MRv)?
1. Algorithms that require applying the same mathematical function to large numbers of individual binary records. 2. Relational operations on large amounts of structured and semi-structured data. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Large-scale graph algorithms that require one-step link traversal. 5. Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).
Ans : 3 Exp :
Limitations of Mapreduce where not to use Mapreduce
While very powerful and applicable to a wide variety of problems, MapReduce is not the answer to every problem. Here are some problems I found where MapReudce is not suited and some papers that address the limitations of MapReduce.
1. Computation depends on previously computed values If the computation of a value depends on previously computed values, then MapReduce cannot be used. One good example is the Fibonacci series where each value is summation of the previous two values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be computed on a single machine, then it is better to do it as a single reduce(map(data)) operation rather than going through the entire map reduce process.
2. Full-text indexing or ad hoc searching The index generated in the Map step is one dimensional, and the Reduce step must not generate a large amount of data or there will be a serious performance degradation. For example, CouchDB's MapReduce may not be a good fit for full-text indexing or ad hoc searching. This is a problem better suited for a tool such as Lucene.
3. Access Mostly Uused Products by 50000+ Subscribers Solutions to many interesting problems in text processing do not require global synchronization. As a result, they can be expressed naturally in MapReduce, since map and reduce tasks run independently and in isolation. However, there are many examples of algorithms that depend crucially on the existence of shared global state during processing, making them difficult to implement in MapReduce (since the single opportunity for global synchronization in MapReduce is the barrier between the map and reduce phases of processing)
Question : In the reducer, the MapReduce API provides you with an iterator over Writable values. What does calling the next () method return?
1. It returns a reference to a different Writable object time. 2. It returns a reference to a Writable object from an object pool. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a new object. 5. It returns a reference to the same Writable object if the next value is the same as the previous value, or a new Writable object otherwise.
Ans : 3 Exp :Calling Iterator.next() will always return the SAME EXACT instance of IntWritable, with the contents of that instance replaced with the next value. Manipulating iterator in MapReduce
Question : Analyze each scenario below and identify which best describes the behavior of the default partitioner?
1. The default partitioner assigns key-values pairs to reduces based on an internal random number generator. 2. The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each reducer in turn. This ensures an event partition of the key space. 3. Access Mostly Uused Products by 50000+ Subscribers with different buckets, and each bucket is assigned to a specific reducer. 4. The default partitioner computers the hash of the key and divides that valule modulo the number of reducers. The result determines the reducer assigned to process the key-value pair. 5. The default partitioner computers the hash of the value and takes the mod of that value with the number of reducers. The result determines the reducer assigned to process the key-value pair.
Ans : 4 Exp :
The default partitioner computes a hash value for the key and assigns the partition based on this result. The default Partitioner implementation is called HashPartitioner. It uses the hashCode() method of the key objects modulo the number of partitions total to determine which partition to send a given (key, value) pair to. In Hadoop, the default partitioner is HashPartitioner, which hashes a record's key to determine which partition (and thus which reducer) the record belongs. The number of partition is then equal to the number of reduce tasks for the job.
Question : You need to move a file titled "weblogs" into HDFS. When you try to copy the file, you can't. You know you have ample space on your DataNodes. Which action should you take to relieve this situation and store more files in HDFS?
1. Increase the block size on all current files in HDFS. 2. Increase the block size on your remaining files. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Increase the amount of memory for the NameNode. 5. Increase the number of disks (or size) for the NameNode.
Ans : 3
Exp :
*-put localSrc destCopies the file or directory from the local file system identified by localSrc to dest within the DFS.
*What is HDFS Block size? How is it different from traditional file system block size? In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size.
Question : In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase?
Ans : 1 Exp :A MapReduce job with mappers and r reducers involves up to m*r distinct copy operations, since each mapper may have intermediate output going to every reducer.
Question : Workflows expressed in Oozie can contain:
1. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins. 2. Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins. 3. Access Mostly Uused Products by 50000+ Subscribers handlers but no forks. 4. Iterative repetition of MapReduce jobs until a desired answer or state is reached.
Ans : 1 Exp :
Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language). hPDL is a fairly compact language, using a limited amount of flow control and action nodes. Control nodes define the flow of execution and include beginning and end of a workflow (start, end and fail nodes) and mechanisms to control the workflow execution path ( decision, fork and join nodes).
Workflow definitions Currently running workflow instances, including instance states and variables
Oozie is a Java Web-Application that runs in a Java servlet-container - Tomcat and uses a database to store
Question : Which best describes what the map method accepts and emits?
1. It accepts a single key-value pair as input and emits a single key and list of corresponding values as output. 2. It accepts a single key-value pairs as input and can emit only one key-value pair as output. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.
Ans : 4 Exp :
public class Mapper(KEYIN,VALUEIN,KEYOUT,VALUEOUT) extends Object Maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
Question : When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?
1. When the types of the reduce operation's input key and input value match the types of the reducer's output key and output value and when the reduce operation is both communicative and associative. 2. When the signature of the reduce method matches the signature of the combine method. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase performance. 5. Never. Combiners and reducers must be implemented separately because they serve different purposes.
Ans : 1 Exp : You can use your reducer code as a combiner if the operation performed is commutative and associative.
Question : You want to perform analysis on a large collection of images. You want to store this data in HDFS and process it with MapReduce but you also want to give your data analysts and data scientists the ability to process the data directly from HDFS with an interpreted high-level programming language like Python. Which format should you use to store this data in HDFS?
Ans : 1 Exp :So what should we do in order to deal with huge amount of images? Use hadoop sequence files! Those are map files that inherently can be read by map reduce applications ?there is an input format especially for sequence files ?and are splitable by map reduce, so we can have one huge file that will be the input of many map tasks.
By using those sequence files we are letting hadoop use its advantages. It can split the work into chunks so the processing is parallel, but the chunks are big enough that the process stays efficient.
Since the sequence file are map file the desired format will be that the key will be text and hold the HDFS filename and the value will be BytesWritable and will contain the image content of the file.
Hadoop binary files processing introduced by image duplicates finder
Question : You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?
1. Run all the nodes in your production cluster as virtual machines on your development workstation 2. Run the hadoop command with the local and the .s file:///options 3. Access Mostly Uused Products by 50000+ Subscribers 4. Run simldooop, the Apache open-source software for simulating Hadoop clusters Ans : 1 Exp : As well as large-scale cloud infrastructures, there is another deployment pattern: local VMs on desktop systems or other development machines. This is a good tactic if your physical machines run windows and you need to bring up a Linux system running Hadoop, and/or you want to simulate the complexity of a small Hadoop cluster.
Have enough RAM for the VM to not swap. Don't try and run more than one VM per physical host, it will only make things slower. use file: URLs to access persistent input and output data. consider making the default filesystem a file: URL so that all storage is really on the physical host. It's often faster and preserves data better.
Question : in the standard word count MapReduce algorithm, why might using a combiner reduce the overall Job running time?
1. Because combiners perform local aggregation of word counts, thereby allowing the mappers to process input data faster. 2. Because combiners perform local aggregation of word counts, thereby reducing the number of mappers that need to run. 3. Access Mostly Uused Products by 50000+ Subscribers reducers without writing the intermediate data to disk. 4. Because combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be snuff across the network to the reducers. Ans : 4
Question : If you run the word count MapReduce program with m mappers and r reducers, how many output files will you get at the end of the job? And how many key-value pairs will there be in each file? Assume k is the number of unique words in the input files.
1. There will be r files, each with exactly k/r key-value pairs. 2. There will be r files, each with approximately k/m key-value pairs. 3. Access Mostly Uused Products by 50000+ Subscribers 4. There will be m files, each with exactly k/m key value pairs. E. There will be m files, each with approximately k/m key-value pairs. Ans : 1