Mapr (HP) Hadoop Developer Certification Questions and Answers (Dumps and Practice Questions)

Question : Determine which best describes when the reduce method is first called in a MapReduce job?

1. Reducers start copying intermediate key-value pairs from each Mapper as soon as it
has completed. The programmer can configure in the job what percentage of the
intermediate data should arrive before the reduce method begins.
2. Reducers start copying intermediate key-value pairs from each Mapper as soon as it
has completed. The reduce method is called only after all intermediate data has been
copied and sorted.
3. Access Mostly Uused Products by 50000+ Subscribers
optimal performance for map-only or reduce-only jobs.
4. Reducers start copying intermediate key-value pairs from each Mapper as soon as it
has completed. The reduce method is called as soon as the intermediate key-value pairs
start to arrive.

Correct Answer : Get Lastest Questions and Answer :
Explanation:

Question : What is the possible block size in HDFS

1. 512 Bytes
2. 64 MB
3. Access Mostly Uused Products by 50000+ Subscribers
4. None of the above

Correct Answer : Get Lastest Questions and Answer :

The Default File Block Size in HDFS is 64MB .

Question : To Create Sequence of Multiple MapReduce job (Chaining) same JobConf object is used

1. True
2. False

Correct Answer : Get Lastest Questions and Answer :

To chain multiple MapReduce job, the driver at each job will have to create a new JobConf object
and needs to set input path to be the output path of pervious job.

Related Questions

Question : No reducers can start until all the mappers have finished ?

1. True
2. False

Question :Hadoop daemon can share the JVM

1. True
2. False

Question : The cluster block size is set to MB. The input file contains MB of valid input data
and is loaded into HDFS with the default block size. How many map tasks will be run during the execution of this job?

1. 1
2. 2
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4

Question :

Which statement is true with respect to MapReduce 2.0 or YARN

1. It is the newer version of MapReduce, using this performance of the data processing can be increased.
2. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker,
resource management and job scheduling or monitoring, into separate daemons.
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above
5. Only 2 and 3 are correct
Ans : 5
Exp : MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker,
resource management and job scheduling or monitoring, into separate daemons. The idea is to have a global ResourceManager (RM)
and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

Question : Which is the component of the ResourceManager

1. 1. Scheduler
2. 2. Applications Manager
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4. All of the above
5. Only 1 and 2 are correct
Ans : 5

Exp : The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities,
queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.

Question :

Scheduler of Resource Manager guarantees about restarting failed tasks either due to application failure or hardware failures.
1. True
2. False

1. True
2. False
Ans : 2
Exp : The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of
capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status
for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware
failures. The Scheduler performs its scheduling function based the resource requirements of the applications;
it does so based on the abstract notion of a resource Container which incorporates elements such as memory,
cpu, disk, network etc.

Question :

Which statement is true about ApplicationsManager

1. is responsible for accepting job-submissions
2. negotiating the first container for executing the application specific ApplicationMaster
and provides the service for restarting the ApplicationMaster container on failure.
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above
5. 1 and 2 are correct
Ans : 5
Exp : The ApplicationsManager is responsible for accepting job-submissions,
negotiating the first container for executing the application specific ApplicationMaster and provides the
service for restarting the ApplicationMaster container on failure.

Question :

Which java class represent a directory or file to get the metadata information e.g BlockSize, replication factor, ownership and permission etc.

1. FileSystem
2. FileStatus
3. Access Mostly Uused Products by 50000+ Subscribers
4. FileDirectory
5. 1 and 2 are correct
Ans : 2

Question :

Which is Parent abstract class of HDFS Filesystem

1. org.apache.hadoop.fs.AbstractFileSystem
2. org.apache.hadoop.fs.FileSystem
3. Access Mostly Uused Products by 50000+ Subscribers
4. org.apache.hadoop.fs.ItFileSystem
Ans : 2

Question :

What is the default value of dfs.replication in case of pseudo distributed cluster..

1. 1
2. 2
3. Access Mostly Uused Products by 50000+ Subscribers
4. 0
Ans : 1

Question :

NameNode store block locations persistently ?

1. True
2. Flase
Ans : 2
Exp : NameNode does not store block locations persistently, since this information is reconstructed from datanodes when system starts.

Question :

Which tool is used to list all the blocks of a file ?

1. hadoop fs
2. hadoop fsck
3. Access Mostly Uused Products by 50000+ Subscribers
4. Not Possible
Ans : 2

Question : HDFS can not store a file which size is greater than one node disk size :

1. True
2. False
Ans : 2
Exp : It can store because it is divided in block and block can be stored anywhere..

Question : When is the earliest point at which the reduce method of a given Reducer can be called?

1. As soon as at least one mapper has finished processing its input split.
2. As soon as a mapper has emitted at least one record.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It depends on the InputFormat used for the job.
Ans : 3
Exp : In a MapReduce job reducers do not start executing the reduce method until the all Map
jobs have completed. You can customize when the reducers startup by changing the default value of
mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait
for all the mappers to finish before starting the reducers. A value of 0.0 will start the
reducers right away. A value of 0.5 will start the reducers when half of the mappers are
complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job
basis.
Typically, keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever
has multiple jobs running at once. This way the job doesn't hog up reducers when they
aren't doing anything but copying data. If you only ever have one job running at a time,
doing 0.1 would probably be appropriate.

Question :Which describes how a client reads a file from HDFS?

1. The client queries the NameNode for the block locations. The NameNode returns
the block locations to the client. The client reads the data directory off the
DataNodes.
2. The client queries all DataNodes in parallel. The DataNode that contains the
requested data responds directly to the client. The client reads the data directly off the
DataNode.
3. Access Mostly Uused Products by 50000+ Subscribers
queries the DataNodes for block locations. The DataNodes respond to the NameNode,
and the NameNode redirects the client to the DataNode that holds the requested data
blocks. The client then reads the data directly off the DataNode.
4. The client contacts the NameNode for the block locations. The NameNode contacts
the DataNode that holds the requested data block. Data is transferred from the DataNode
to the NameNode, and then from the NameNode to the client.
Ans : 1
Exp : The Client communication to HDFS happens using Hadoop HDFS API. Client
applications talk to the NameNode whenever they wish to locate a file, or when they
want to add or copy or move or delete a file on HDFS. The NameNode responds the successful
requests by returning a list of relevant DataNode servers where the data lives. Client
applications can talk directly to a DataNode, once the NameNode has provided the
location of the data.

Question : You are developing a combiner that takes as input Text keys, IntWritable values, and
emits Text keys, IntWritable values. Which interface should your class implement?

1. Combiner {Text, IntWritable, Text, IntWritable}
2. Mapper {Text, IntWritable, Text, IntWritable}
3. Access Mostly Uused Products by 50000+ Subscribers
4. Reducer {Text, IntWritable, Text, IntWritable}
5. Combiner {Text, Text, IntWritable, IntWritable}
Ans : 4

Question : Identify the utility that allows you to create and run MapReduce jobs with any
executable or script as the mapper and/or the reducer?

1. Oozie
2. Sqoop
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hadoop Streaming
5. mapred
Ans : 4

Question : How are keys and values presented and passed to the reducers during a standard sort and
shuffle phase of MapReduce?

1. Keys are presented to reducer in sorted order; values for a given key are not sorted.
2. Keys are presented to reducer in sorted order; values for a given key are sorted in
ascending order.
3. Access Mostly Uused Products by 50000+ Subscribers
sorted.
4. Keys are presented to a reducer in random order; values for a given key are sorted in
ascending order.
Ans : 1
Explanation:
Reducer has 3 primary phases:
1.Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the
network.
2.Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have
output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched
they are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application
should extend the key with the secondary key and define a grouping comparator. The
keys will be sorted using the entire key, but will be grouped using the grouping
comparator to decide which keys and values are sent in the same call to reduce.
3. Access Mostly Uused Products by 50000+ Subscribers
In this phase the reduce(Object, Iterable, Context) method is called for each {key,
(collection of values)}in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference:
org.apache.hadoop.mapreduce,Class
Reducer{KEYIN,VALUEIN,KEYOUT,VALUEOUT}

Question : You wrote a map function that throws a runtime exception when it encounters a control
character in input data. The input supplied to your mapper contains twelve such
characters totals, spread across five file splits. The first four file splits each have two
control characters and the last split has four control characters. Identify the number of
failed task attempts you can expect when you run the job with
mapred.max.map.attempts set to 4

1. You will have forty-eight failed task attempts
2. You will have seventeen failed task attempts
3. Access Mostly Uused Products by 50000+ Subscribers
4. You will have twelve failed task attempts
5. You will have twenty failed task attempts
Ans : 5
There will be four failed task attempts for each of the five file splits.

Question : You want to populate an associative array in order to perform a map-side join. You’ve decided
to put this information in a text file, place that file into the DistributedCache and read it in your
Mapper before any records are processed.

Identify which method in the Mapper you should use to implement code for reading the file and
populating the associative array?

1. combine
2. map
3. Access Mostly Uused Products by 50000+ Subscribers
4. configure
Ans : 4

Question : You are running a job that will process a single InputSplit on a cluster which has no other jobs
currently running. Each node has an equal number of open Map slots. On which node will Hadoop first attempt to run the Map task?

1. The node with the most memory
2. The node with the lowest system load
3. Access Mostly Uused Products by 50000+ Subscribers
4. The node with the most free local disk space
Ans : 3
Exp : Hadoop tries data localization processing, it means whenever there is a possibility JobTracker will
try to start the task where data block is available.

Question : Assuming default settings, which best describes the order of data provided to a reducer’s
reduce method

1. The keys given to a reducer aren’t in a predictable order, but the values associated with those
keys always are.
2. Both the keys and values passed to a reducer always appear in sorted order.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The keys given to a reducer are in sorted order but the values associated with each key are in
no predictable order
Ans : 4
Exp :

Question : In a MapReduce job which process millions of input records and generated the same amount of key-value pairs (In millions).
The data is not uniformly distributed. Hence MapReduce job is going to create a significant
amount of intermediate data that it needs to transfer between mappers and reduces which is a potential
bottleneck. A custom implementation of which interface is most likely to reduce the amount of
intermediate data transferred across the network?

1. Partitioner
2. OutputFormat
3. Access Mostly Uused Products by 50000+ Subscribers
4. Combiner
Ans : 4
Exp : Using the combiner (Which is generally same code as Reducer) we can reduce the transfer of data over the
network during shuffle step.

Question : Joining of two large datasets in MapReduce is possible like RDBMS tables sharing a key? Assume that the
two datasets are in CSV format stored in HDFS.

1. Yes
2. Yes, but only if one of the tables fits into memory
3. Access Mostly Uused Products by 50000+ Subscribers
4. No, MapReduce cannot perform relational operations.
5. No, but it can be done with either Pig or Hive.

Ans : 1
Exp :Yes, it is possible to join large datasets in Hadoop MapReduce, there are various approaches available for
Joining the data in MapReduce e.g. Map-Side, Reduce-side and using Distributed Cache etc.

Question : You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the
Mapper's map method?

1. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk
2. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
5. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.

Ans : 3
Exp :The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual
mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop
administrator. The intermediate data is cleaned up after the Hadoop Job completes

Question : You want to understand more about how users browse your public website named MapRPublic.com, such as which pages they visit
prior to placing an order. You have a farm of 500 tomcat and JBosss servers, on Cloud (Microsoft Azure) hosting your website. How will you gather this
data for your analysis?

1. Ingest the server web logs into HDFS using Flume.
2. Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Channel these clickstreams into Hadoop using Hadoop Streaming.
5. Sample the weblogs from the web servers, copying them into Hadoop using curl.

Ans : 1
Exp : Because flume is the tool which is used to insert the webserver logs in
HDFS. However, the parsing example is given because you can understand how real parsing can be done
once data ingested using the flume in the HDFS.

Hadoop MapReduce for Parsing Weblogs
Here are the steps for parsing a log file using Hadoop MapReduce:
Load log files into the HDFS location using this Hadoop command:
hadoop fs -put (local file path of weblogs) (hadoop HDFS location) The Opencsv2.3.jar framework is used
for parsing log records.
Below is the Mapper program for parsing the log file from the HDFS location.

public static class ParseMapper
extends Mapper(Object, Text, NullWritable,Text ){
private Text word = new Text();
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException {
CSVParser parse = new CSVParser(' ','\"');
String sp[]=parse.parseLine(value.toString());
int spSize=sp.length;
StringBuffer rec= new StringBuffer();
for(int i=0;i lessthan spSize;i++){
rec.append(sp[i]);
if(i!=(spSize-1))
rec.append(",");
}
word.set(rec.toString());
context.write(NullWritable.get(), word);
}}

The command below is the Hadoop-based log parse execution. TheMapReduce program is attached in this
article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and
move it to the Hadoop distributed job tracker system.
hadoop jar (path of logparse jar) (hadoop HDFS logfile path) (output path of parsed log file) The output file
is stored in the HDFS location, and the output file name starts with "part-".

Question : You need to run the same job many times with minor variations. Rather than hardcoding all job configuration
options in your drive code, you've decided to have your Driver subclass org.apache.hadoop.conf.Configured
and implement the org.apache.hadoop.util.Tool interface.
Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?

1. hadoop "mapred.job.name=Example" MyDriver input output
2. hadoop MyDriver mapred.job.name=Example input output
3. Access Mostly Uused Products by 50000+ Subscribers
4. hadoop setproperty mapred.job.name=Example MyDriver input output
5. hadoop setproperty ("mapred.job.name=Example") MyDriver input output

Ans : 3
Exp :Configure the property using the -D key=value notation:
-D mapred.job.name='My Job'
You can list a whole bunch of options by calling the streaming jar with just the -info argument
Python hadoop streaming : Setting a job name

Question : You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the
year (IntWritable) and input values representing product identifies (Text).
Identify what determines the data types used by the Mapper for a given job.

1. The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
2. The data types specified in HADOOP_MAP_DATATYPES environment variable
3. Access Mostly Uused Products by 50000+ Subscribers
4. The InputFormat used by the job determines the mapper's input key and value types.

Ans : 4
Exp :
The input types fed to the mapper are controlled by the InputFormat used. The default input format,
"TextInputFormat," will load data in as (LongWritable, Text) pairs. The long value is the byte offset of the line in
the file. The Text object holds the string contents of the line of the file.

Note:The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass().
The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the
methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override
these.

Question : Select the feature of Mapreduce

1. Automatic parallelization and distribution
2. fault-tolerance
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above

Question : For each input key-value pair, mappers can emit:

1. As many intermediate key-value pairs as designed. There are no restrictions on the types of those keyvalue
pairs (i.e., they can be heterogeneous).
2. As many intermediate key-value pairs as designed, but they cannot be of the same type as the input keyvalue pair.
3. Access Mostly Uused Products by 50000+ Subscribers
4. One intermediate key-value pair, but of the same type.
5. As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the
values have the same type.

Ans : 5
Exp : Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records. The transformed
intermediate records do not need to be of the same type as the input records. A given input pair may map to
zero or many output pairs.

Question : You have the following key-value pairs as output from your Map task:
(the, 1)
(fox, 1)
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)
How many keys will be passed to the Reducer's reduce method?

1. Six
2. Five
3. Access Mostly Uused Products by 50000+ Subscribers
4. Two
5. One

Ans : 2
Exp :

Question : You have user profile records in your OLPT database, that you want to join with web logs you have already
ingested into the Hadoop file system. How will you obtain these user records?

1. HDFS command
2. Pig LOAD command
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hive LOAD DATA command
5. Ingest with Flume agents

Ans : 3
Exp :Apache Hadoop and Pig provide excellent tools for extracting and analyzing data from very large Web logs.
We use Pig scripts for sifting through the data and to extract useful information from the Web logs.
We load the log file into Pig using the LOAD command.
raw_logs = LOAD 'apacheLog.log' USING TextLoader AS (line:chararray);
Note 1:
Data Flow and Components
*Content will be created by multiple Web servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework.
FLUME has agents running on Web servers; these are machines that collect data intermediately using collectors and finally push that data to HDFS.
*Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job solution). These scripts actually analyze
the logs on various dimensions and extract the results. Results from Pig are by default inserted into HDFS, but we can use storage implementation
for other repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please see the implementation section). Pig
Scripts can either push this data to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts can push this data into HBase directly. In this
article, we use scripts to push data onto HDFS, as we are showcasing the Pig framework applicability for log analysis at large scale. *The database HBase will have the data processed
by Pig scripts ready for reporting and further slicing and dicing. *The data-access Web service is a REST-based service that eases the access and integrations with data clients. The
client can be in any language to access REST-based API. These clients could be BI- or UI-based
clients.
Note 2:
The Log Analysis Software Stack
*Hadoop is an open source framework that allows users to process very large data in parallel. It's based on the framework that supports Google search engine. The Hadoop core is
mainly divided into two modules: 1.HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple
commodity servers connected in a cluster. 2.Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is
bonded with HDFS. *The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can
keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time,
random read/write access to very large data sets -- HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple versions of a
single row. *The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to
ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a
scripting language. *Flume is a distributed, reliable and available service for collecting, aggregating and moving a large amount of log data (src flume-wiki). It was built to push
large logs into Hadoop-HDFS for further processing. It's a data flow solution, where there is an originator and destination for each node and is divided into Agent and Collector
tiers for collecting logs and pushing them to destination storage.

Question : What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your
workload across you cluster?

1. You will not be able to compress the intermediate data.
2. You will longer be able to take advantage of a Combiner.
3. Access Mostly Uused Products by 50000+ Subscribers
4. There are no concerns with this approach. It is always advisable to use multiple reduces.

Ans : 3
Exp :your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred- site.xml has
been set to a number larger than 1, or because you've used the -r option to specify the number of reducers on
the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the
reducers. Use of the HashPartitioner means that you can't concatenate your output files to create a single
sorted output file. To do this you'll need total ordering,

Question : Given a directory of files with the following structure: line number, tab character, string:
Example:
1filedata1
2filedata2
3filedata3
You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the
line: conf.setInputFormat (____.class) ; ?

1. SequenceFileAsTextInputFormat
2. SequenceFileInputFormat
3. Access Mostly Uused Products by 50000+ Subscribers
4. BDBInputFormat

Ans : 3
Exp :KeyValueTextInputFormat
TextInputFormats keys, being simply the offset within the file, are not normally very useful.It is
common for each line in a file to be a key value pair, separated by
a delimiter such as a tab character. For example, this is the output produced by TextOutputFormat.
Hadoop File System default output format. To interpret such files correctly, KeyValueTextInputFormat
is appropriate.
You can specify the separator via the mapreduce.input.keyvaluelinerecordreader.key.value.separator
property or key.value.separator.in.input.line in the old API
It is a tab character by default. Consider the following input file, where space represent a horizontal
tab character
line1 On the top of the Crumpetty Tree
line2 The Quangle Wangle sat,
line3 But his face you could not see,
line4 On account of his Beaver Hat.
Like in the TextInputFormat case, the input is in a single split comprising four records,although this
time the keys are the Text sequences before the tab in each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
SequenceFileInputFormat
To use data from sequence files as the input to MapReduce, you use SequenceFileInputFormat. The
keys and values are determined by the sequence file, and you need to
make sure that your map input types correspond

Question : For each intermediate key, each reducer task can emit:

1. As many final key-value pairs as desired. There are no restrictions on the types of those key- value pairs (i.
e., they can be heterogeneous).
2. As many final key-value pairs as desired, but they must have the same type as the intermediate key-value
pairs.
3. Access Mostly Uused Products by 50000+ Subscribers
have the same type.
4. One final key-value pair per value associated with the key; no restrictions on the type.
5. One final key-value pair per key; no restrictions on the type.

Ans : 3
Exp :Reducer reduces a set of intermediate values which share a key to a smaller set of values.
Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an
input list. It then combines these values together, returning a single output value.

Question : What data does a Reducer reduce method process?

1. All the data in a single input file.
2. All data produced by a single mapper.
3. Access Mostly Uused Products by 50000+ Subscribers
4. All data for a given value, regardless of which mapper(s) produced it.

Ans : 3
Exp :Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an
input list. It then combines these values together, returning a single output value.
All values with the same key are presented to a single reduce task.

Question : All keys used for intermediate output from mappers must:

1. Implement a splittable compression algorithm
2. Be a subclass of FileInputFormat.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Override isSplitable.
5. Implement a comparator for speedy sorting.

Ans : 3
Exp :The MapReduce framework operates exclusively on (key, value) pairs, that is, the framework views the
input to the job as a set of (key, value) pairs and produces a set of (key, value) pairs as the output of the job,
conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to implement the
Writable interface. Additionally, the key classes have to implement the WritableComparable interface to
facilitate sorting by the framework.

Question : On a cluster running MapReduce v (MRv), a TaskTracker heartbeats into the JobTracker on your cluster,
and alerts the JobTracker it has an open map task slot.
What determines how the JobTracker assigns each map task to a TaskTracker?

1. The amount of RAM installed on the TaskTracker node.
2. The amount of free disk space on the TaskTracker node.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The average system load on the TaskTracker node over the past fifteen (15) minutes.
5. The location of the InsputSplit to be processed in relation to the location of the node.

Ans : 5
Exp :The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure
the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots,
so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker
tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on
the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a
machine in the same rack.

Question : A client application creates an HDFS file named foo.txt with a replication factor of . Identify which best
describes the file access rules in HDFS if the file has a single block that is stored on data nodes A, B and C?

1. The file will be marked as corrupted if data node B fails during the creation of the file.
2. Each data node locks the local file to prohibit concurrent readers and writers of the file.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Each data node stores a copy of the file in the local file system with the same name as the HDFS file.
5. The file can be accessed if at least one of the data nodes storing the file is available.

Ans : 5
Exp :HDFS keeps three copies of a block on three different datanodes to protect against true data corruption.
HDFS also tries to distribute these three replicas on more than one rack to protect against data availability
issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately
schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is
sufficient to avoid corrupted files.
Note:
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a
sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. An application can
specify the number of replicas of a file. The replication factor can be specified at file creation time and can be
changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all
decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default
configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack
and 3rd copy on a different rack.

Question : Which process describes the lifecycle of a Mapper?

1. The JobTracker calls the TaskTracker's configure () method, then its map () method and finally its close () method.
2. The TaskTracker spawns a new Mapper to process all records in a single input split.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The JobTracker spawns a new Mapper to process all records in a single file.

Ans : 3
Exp :For each map instance that runs, the TaskTracker creates a new instance of your mapper.
Note:
*The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat. The mapper may
perform a number of Extraction and Transformation functions on the Key/Value pair before ultimately
outputting none, one or many Key/Value pairs of the same, or different Key/Value type.
*With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class. This class
defines an 'Identity' map function by default - every input Key/Value pair obtained from the InputFormat is
written out.
Examining the run() method, we can see the lifecycle of the mapper:
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException { setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context); }
cleanup(context);
}
setup(Context) - Perform any setup for the mapper. The default implementation is a no-op method. map(Key,
Value, Context) - Perform a map operation in the given Key / Value pair. The default implementation calls
Context.write(Key, Value)
cleanup(Context) - Perform any cleanup for the mapper. The default implementation is a no-op method.

Question : To process input key-value pairs, your mapper needs to lead a MB data file in memory. What is the best
way to accomplish this?

1. Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the mapper.
2. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper.

Ans : 4
Exp :Hadoop has a distributed cache mechanism to make available file locally that may be needed by Map/
Reduce jobs Use Case
Lets understand our Use Case a bit more in details so that we can follow-up the code snippets. We have a
Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need to replace all keywords
that we encounter during parsing, with some other value.
So what we need is A key-values files (Lets use a Properties files)
The Mapper code that uses the code Write the Mapper code that uses it view source

public class DistributedCacheMapper extends Mapper(LongWritable, Text, Text, Text) {
Properties cache;
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
if(localCacheFiles != null) {
// expecting only single file here
for (int i = 0; i lessthan localCacheFiles.length; i++) {
Path localCacheFile = localCacheFiles[i];
cache = new Properties();
cache.load(new FileReader(localCacheFile.toString()));
}} else {
// do your error handling here
}}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// use the cache here
// if value contains some attribute, cache.get(value)
// do some action or replace with something else
}}
Note:
* Distribute application-specific large, read-only files efficiently.
DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.)
needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache
assumes that the files specified via hdfs:// urls are already present on the FileSystem at the path specified by
the url.

Question : In a MapReduce job, the reducer receives all values associated with same key. Which statement best
describes the ordering of these values?

1. The values are in sorted order.
2. The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.
Ans : 2
Exp :*Input to the Reducer is the sorted output of the mappers. *The framework calls the application's Reduce
function once for each unique key in the sorted order.
*Example:
For the given sample input the first map emits:
(Hello, 1)
( World, 1)
( Bye, 1)
( World, 1)
The second map emits:
( Hello, 1)
( Hadoop, 1)
( Goodbye, 1)
( Hadoop, 1)

Question : You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that
uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For
each one of these characters, you will emit the character as a key and an InputWritable as the value. As this
will produce proportionally more intermediate data than input data, which two resources should you expect to
be bottlenecks?

1. Processor and network I/O
2. Disk I/O and network I/O
3. Access Mostly Uused Products by 50000+ Subscribers
4. Processor and disk I/O

Ans : 2

Question : You want to count the number of occurrences for each unique word in the supplied input data. You've decided
to implement this by having your mapper tokenize each word and emit a literal value 1, and then have your
reducer increment a counter for each literal 1 it receives. After successful implementing this, it occurs to you
that you could optimize this by specifying a combiner. Will you be able to reuse your existing Reduces as your
combiner in this case and why or why not?

1. Yes, because the sum operation is both associative and commutative and the input and output types to the
reduce method match.
2. No, because the sum operation in the reducer is incompatible with the operation of a Combiner.
3. Access Mostly Uused Products by 50000+ Subscribers
4. No, because the Combiner is incompatible with a mapper which doesn't use the same data type for both
the key and value.
5. Yes, because Java is a polymorphic object-oriented language and thus reducer code can be reused as a
combiner.

Ans : 1
Exp :Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate
intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of
data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the
operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop
may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your
MapReduce jobs should not depend on the combiners execution.

Question : Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on
which the Hadoop framework will look for an available slot schedule a MapReduce operation.

1. TaskTracker
2. NameNode
3. Access Mostly Uused Products by 50000+ Subscribers
4. JobTracker
5. Secondary NameNode

Ans : 4
Exp : JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One
Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical
production cluster its run on a separate machine. Each slave node is configured with job tracker node location.
The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs
are halted. JobTracker in Hadoop performs following actions(from Hadoop Wiki:)
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data The JobTracker locates
TaskTracker nodes with available slots at or near the data The JobTracker submits the work to the chosen
TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often
enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. A TaskTracker
will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job
elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the
TaskTracker as unreliable. When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.

Question : Which project gives you a distributed, Scalable, data store that allows you random, real-time read/write access
to hundreds of terabytes of data?

1. HBase
2. Hue
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hive
5. Oozie

Ans : 1
Exp : Use Apache HBase when you need random, realtime read/write access to your Big Data.

This HBase goal is the hosting of very large tables

- billions of rows X millions of columns
- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column oriented
store modeled after Google's Bigtable:
- A Distributed Storage System for Structured Data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Features of HBases
- Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
- Extensible jruby-based (JIRB) shell
- Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

Question : Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and
generate Java classes to interact with that imported data?

1. Oozie
2. Flume
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hue
5. Sqoop

Ans : 5

Exp :Sqoop ("SQL-to-Hadoop") is a straightforward command-line tool with the following capabilities:
Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact
with your imported data Provides the ability to import from SQL databases straight into your Hive data
warehouse

Data Movement Between Hadoop and Relational Databases
Data can be moved between Hadoop and a relational database as a bulk data transfer, or relational tables can
be accessed from within a MapReduce map function.
Note:

* Hadoop provides a bulk data transfer tool (i.e., Sqoop) that imports individual
tables or entire databases into HDFS files. The tool also generates Java classes that support interaction with
the imported data. Sqoop supports all relational databases over JDBC, and Quest Software provides a
connector (i.e., OraOop) that has been optimized for access to data residing in Oracle databases.

Question : You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt and #data.
txt. How many files will be processed by the FileInputFormat.setInputPaths () command when it's given a path
object representing this directory?

1. Four, all files will be processed
2. Three, the pound sign is an invalid character for HDFS file names
3. Access Mostly Uused Products by 50000+ Subscribers
4. None, the directory cannot be named jobdata
5. One, no special characters can prefix the name of an input file

Ans : 3
Exp :Files starting with '_' are considered 'hidden' like unix files startingwith '.'.
# characters are allowed in HDFS file names.

Question : You write MapReduce job to process files in HDFS. Your MapReduce algorithm uses TextInputFormat:
the mapper applies a regular expression over input values and emits key-values pairs with the key consisting
of the matching text, and the value containing the filename and byte offset. Determine the difference between
setting the number of reduces to one and settings the number of reducers to zero.

1. There is no difference in output between the two settings.
2. With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of
matching patterns are stored in a single file on HDFS.
3. Access Mostly Uused Products by 50000+ Subscribers
reducer, instances of matching patterns are stored in multiple files on HDFS.
4. With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer,
all instances of matching patterns are gathered together in one file on HDFS.

Ans : 4
Exp :It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by
setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
*Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.
reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the
mapper tasks will be the final output of the job.

Reduce
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each
(key, (list of values)) pair in the grouped inputs.
The output of the reduce task is typically written to the FileSystem via OutputCollector.collect
(WritableComparable, Writable).
Applications can use the Reporter to report progress, set application-level status messages and update
Counters, or just indicate that they are alive.
The output of the Reducer is not sorted.

Question : In a MapReduce job with map tasks, how many map task attempts will there be?

1. It depends on the number of reduces in the job.
2. Between 500 and 1000.
3. Access Mostly Uused Products by 50000+ Subscribers
4. At least 500.
5. Exactly 500.

Ans : 4
Exp :

Task attempt is a particular instance of an attempt to execute a task ?

There will be at least as many task attempts as there are tasks
If a task attempt fails, another will be started by the JobTracker Speculative
execution can also result in more task attempts than completed tasks

Question : What types of algorithms are difficult to express in MapReduce v (MRv)?

1. Algorithms that require applying the same mathematical function to large numbers of individual binary records.
2. Relational operations on large amounts of structured and semi-structured data.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Large-scale graph algorithms that require one-step link traversal.
5. Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).

Ans : 3
Exp :

Limitations of Mapreduce where not to use Mapreduce

While very powerful and applicable to a wide variety of problems, MapReduce is not the answer to every
problem. Here are some problems I found where MapReudce is not suited and some papers that address the
limitations of MapReduce.

1. Computation depends on previously computed values
If the computation of a value depends on previously computed values, then MapReduce cannot be used. One
good example is the Fibonacci series where each value is summation of the previous two values. i.e., f(k+2) =
f(k+1) + f(k). Also, if the data set is small enough to be computed on a single machine, then it is better to do it
as a single reduce(map(data)) operation rather than going through the entire map reduce process.

2. Full-text indexing or ad hoc searching
The index generated in the Map step is one dimensional, and the Reduce step must not generate a large
amount of data or there will be a serious performance degradation. For example, CouchDB's MapReduce may
not be a good fit for full-text indexing or ad hoc searching. This is a problem better suited for a tool such as
Lucene.

3. Access Mostly Uused Products by 50000+ Subscribers
Solutions to many interesting problems in text processing do not require global synchronization. As a result,
they can be expressed naturally in MapReduce, since map and reduce tasks run independently and in
isolation. However, there are many examples of algorithms that depend crucially on the existence of shared
global state during processing, making them difficult to implement in MapReduce (since the single opportunity
for global synchronization in MapReduce is the barrier between the map and reduce phases of processing)

Question : In the reducer, the MapReduce API provides you with an iterator over Writable values. What does calling the
next () method return?

1. It returns a reference to a different Writable object time.
2. It returns a reference to a Writable object from an object pool.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a
new object.
5. It returns a reference to the same Writable object if the next value is the same as the previous value, or a
new Writable object otherwise.

Ans : 3
Exp :Calling Iterator.next() will always return the SAME EXACT instance of IntWritable, with the contents of that
instance replaced with the next value. Manipulating iterator in MapReduce

Question : Analyze each scenario below and identify which best describes the behavior of the default partitioner?

1. The default partitioner assigns key-values pairs to reduces based on an internal random number
generator.
2. The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each reducer in
turn. This ensures an event partition of the key space.
3. Access Mostly Uused Products by 50000+ Subscribers
with different buckets, and each bucket is assigned to a specific reducer.
4. The default partitioner computers the hash of the key and divides that valule modulo the number of
reducers. The result determines the reducer assigned to process the key-value pair.
5. The default partitioner computers the hash of the value and takes the mod of that value with the number of
reducers. The result determines the reducer assigned to process the key-value pair.

Ans : 4
Exp :

The default partitioner computes a hash value for the key and assigns the partition based on this result.
The default Partitioner implementation is called HashPartitioner. It uses the hashCode() method of the key
objects modulo the number of partitions total to determine which partition to send a given (key, value) pair to.
In Hadoop, the default partitioner is HashPartitioner, which hashes a record's key to determine which partition
(and thus which reducer) the record belongs. The number of partition is then equal to the number of reduce
tasks for the job.

Question : You need to move a file titled "weblogs" into HDFS. When you try to copy the file, you can't. You know you
have ample space on your DataNodes. Which action should you take to relieve this situation and store more
files in HDFS?

1. Increase the block size on all current files in HDFS.
2. Increase the block size on your remaining files.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Increase the amount of memory for the NameNode.
5. Increase the number of disks (or size) for the NameNode.

Ans : 3

Exp :

*-put localSrc destCopies the file or directory from the local file system identified by localSrc to dest within the DFS.

*What is HDFS Block size? How is it different from traditional file system block size? In HDFS data is split into
blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size.
Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on
different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block
size can not be compared with the traditional file system block size.

Question : In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in
the sort/shuffle phase?

1. mXn (i.e., m multiplied by n)
2. n
3. Access Mostly Uused Products by 50000+ Subscribers
4. m+n (i.e., m plus n)
5. E.mn(i.e., m to the power of n)

Ans : 1
Exp :A MapReduce job with mappers and r reducers involves up to m*r distinct copy operations, since
each mapper may have intermediate output going to every reducer.

Question : Workflows expressed in Oozie can contain:

1. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks,
decision points, and path joins.
2. Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce sequences can be
combined with forks and path joins.
3. Access Mostly Uused Products by 50000+ Subscribers
handlers but no forks.
4. Iterative repetition of MapReduce jobs until a desired answer or state is reached.

Ans : 1
Exp :

Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control
dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified
in hPDL (a XML Process Definition Language). hPDL is a fairly compact language, using a limited amount of
flow control and action nodes. Control nodes define the flow of execution and include beginning and end of a
workflow (start, end and fail nodes) and mechanisms to control the workflow execution path ( decision, fork
and join nodes).

Workflow definitions Currently running workflow instances, including instance states and variables

Oozie is a Java Web-Application that runs in a Java servlet-container - Tomcat and uses a database to store

Question : Which best describes what the map method accepts and emits?

1. It accepts a single key-value pair as input and emits a single key and list of corresponding values as output.
2. It accepts a single key-value pairs as input and can emit only one key-value pair as output.
3. Access Mostly Uused Products by 50000+ Subscribers
4. It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.

Ans : 4
Exp :

public class Mapper(KEYIN,VALUEIN,KEYOUT,VALUEOUT)
extends Object
Maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks which transform input records into a intermediate records. The transformed
intermediate records need not be of the same type as the input records. A given input pair may map to zero or
many output pairs.

Question : When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?

1. When the types of the reduce operation's input key and input value match the types of the reducer's output
key and output value and when the reduce operation is both communicative and associative.
2. When the signature of the reduce method matches the signature of the combine method.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase
performance.
5. Never. Combiners and reducers must be implemented separately because they serve different purposes.

Ans : 1
Exp : You can use your reducer code as a combiner if the operation performed is commutative and associative.

Question : You want to perform analysis on a large collection of images. You want to store this data in HDFS and process
it with MapReduce but you also want to give your data analysts and data scientists the ability to process the
data directly from HDFS with an interpreted high-level programming language like Python. Which format
should you use to store this data in HDFS?

1. SequenceFiles
2. Avro
3. Access Mostly Uused Products by 50000+ Subscribers
4. HTML
5. XML

Ans : 1
Exp :So what should we do in order to deal with huge amount of images? Use hadoop sequence files! Those are
map files that inherently can be read by map reduce applications ?there is an input format especially for
sequence files ?and are splitable by map reduce, so we can have one huge file that will be the input of many
map tasks.

By using those sequence files we are letting hadoop use its advantages. It can split the work into
chunks so the processing is parallel, but the chunks are big enough that the process stays efficient.

Since the sequence file are map file the desired format will be that the key will be text and hold the HDFS
filename and the value will be BytesWritable and will contain the image content of the file.

Hadoop binary files processing introduced by image duplicates finder

Question : You want to run Hadoop jobs on your development workstation for testing before you submit them to your
production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production
cluster while using a single machine?

1. Run all the nodes in your production cluster as virtual machines on your development workstation
2. Run the hadoop command with the local and the .s file:///options
3. Access Mostly Uused Products by 50000+ Subscribers
4. Run simldooop, the Apache open-source software for simulating Hadoop clusters
Ans : 1
Exp :
As well as large-scale cloud infrastructures, there is another deployment pattern: local VMs on desktop
systems or other development machines. This is a good tactic if your physical machines run windows and you
need to bring up a Linux system running Hadoop, and/or you want to simulate the complexity of a small
Hadoop cluster.

Have enough RAM for the VM to not swap.
Don't try and run more than one VM per physical host, it will only make things slower. use file: URLs to access
persistent input and output data. consider making the default filesystem a file: URL so that all storage is really
on the physical host. It's often faster and preserves data better.

Question : in the standard word count MapReduce algorithm, why might using a combiner reduce the overall Job running time?

1. Because combiners perform local aggregation of word counts, thereby allowing the mappers to process input data faster.
2. Because combiners perform local aggregation of word counts, thereby reducing the number of mappers that need to run.
3. Access Mostly Uused Products by 50000+ Subscribers
reducers without writing the intermediate data to disk.
4. Because combiners perform local aggregation of word counts,
thereby reducing the number of key-value pairs that need to be snuff across the network to the reducers.
Ans : 4

Question : If you run the word count MapReduce program with m mappers and r reducers, how many output
files will you get at the end of the job? And how many key-value pairs will there be in each file?
Assume k is the number of unique words in the input files.

1. There will be r files, each with exactly k/r key-value pairs.
2. There will be r files, each with approximately k/m key-value pairs.
3. Access Mostly Uused Products by 50000+ Subscribers
4. There will be m files, each with exactly k/m key value pairs. E. There will be m files, each with approximately k/m key-value pairs.
Ans : 1

Question : If a file which is MB how much space block space it will used ?

1. 33 MB
2. 64 MB
3. Access Mostly Uused Products by 50000+ Subscribers
4. None of the Above

Question : How blocks are stored in HDFS

1. As a binary file
2. As a decoded file
3. Access Mostly Uused Products by 50000+ Subscribers
4. Stored as archived