IBM Certified Data Architect - Big Data Certification Questions and Answers (Dumps and Practice Questions)

Question : Your cluster has nine slave nodes. The cluster's block size is set to MB, and its replication factor set to .

How will the Hadoop framework distribute block writes from a MapReduce Reducer into HDFS when that MapReduce Reducer outputs a 300MB file?

1. Reducers don't write blocks into HDFS.
2. The nine blocks will be written to three nodes, such that each of the three gets one copy of each block.
3. Access Mostly Uused Products by 50000+ Subscribers
4. The nine blocks will be written randomly to the nodes; some may receive multiple blocks, some may receive none.
5. The node on which the Reducer is running will receive one copy of each block. The other replicas will be placed on other nodes in the cluster.

Correct Answer : Get Lastest Questions and Answer :

Explanation: In our example, with a block size of 128MB and a replication factor of 3, when the Reducer writes the file it will be split into three blocks (a 128MB block, another 128MB
block, and a 44MB block). Each block will be replicated three times. For efficiency, if a client (in this case the Reduce task) is running in a YARN container on a cluster node,
the first replica of each block it creates will be sent to the DataNode daemon running on that same node. The other two replicas will be written to DataNodes on other machines in
the cluster.

Question : A client application writes a file to HDFS on your cluster. Which two metadata changes occur?

1. The metadata in RAM on the NameNode is updated
2. The change is written to the Secondary NameNode
3. The change is written to the fsimage file
4. The change is written to the edits file
5. The NameNode triggers a block report to update block locations in the edits file
6. The metadata in RAM on the NameNode is flushed to disk

1. 1,3,4
2. 1,4,6
3. Access Mostly Uused Products by 50000+ Subscribers
4. 5,6
4. 4,6
Ans : 3
Exp : The namenode stores its filesystem metadata on local filesystem disks in a few different files, the two most important of which are fsimage and edits. Just like a
database would, fsimage contains a complete snapshot of the filesystem metadata whereas edits contains only incremental modifications made to the metadata. A common practice
for high throughput data stores, use of a write ahead log (WAL) such as the edits file reduces I/O operations to sequential, append-only operations (in the context of the
namenode, since it serves directly from RAM), which avoids costly seek operations and yields better overall performance. Upon namenode startup, the fsimage file is loaded into
RAM and any changes in the edits file are replayed, bringing the in-memory view of the filesystem up to date.

The NameNode metadata contains information about every file stored in HDFS. The NameNode holds the metadata in RAM for fast access, so any change is reflected in that RAM version.
However, this is not sufficient for reliability, since if the NameNode crashes information on the change would be lost. For that reason, the change is also written to a log file
known as the edits file.

In more recent versions of Hadoop (specifically, Apache Hadoop 2.0 and CDH4;), the underlying metadata storage was updated to be more resilient to corruption and to support
namenode high availability. Conceptually, metadata storage is similar, although transactions are no longer stored in a single edits file. Instead, the namenode periodically rolls
the edits file (closes one file and opens a new file), numbering them by transaction ID. It's also possible for the namenode to now retain old copies of both fsimage and edits to
better support the ability to roll back in time. Most of these changes won't impact you, although it helps to understand the purpose of the files on disk. That being said, you
should never make direct changes to these files unless you really know what you are doing. The rest of this book will simply refer to these files using their base names, fsimage
and edits, to refer generally to their function. Recall from earlier that the namenode writes changes only to its write ahead log, edits. Over time, the edits file grows and grows
and as with any log-based system such as this, would take a long time to replay in the event of server failure. Similar to a relational database, the edits file needs to be
periodically applied to the fsimage file. The problem is that the namenode may not have the available resources-CPU or RAM-to do this while continuing to provide service to the
cluster. This is where the secondary namenode comes in.

Question : What does each block of a file contain when it is written into HDFS?

1. Each block writes a separate .meta file containing information on the filename of which the block is a part
2. Each block contains only data from the file
3. Access Mostly Uused Products by 50000+ Subscribers
4. Each block has a header and a footer containing metadata
Ans: 2
Exp : When a file is written into HDFS, it is split into blocks. Each block contains just a portion of the file; there is no extra data in the block file. Although there is
also a .meta file associated with each block, that file contains checksum data which is used to confirm the integrity of the block when it is read. Nothing on the DataNode
contains information about what file the block is a part of; that information is held only on the NameNode.

Question : Which statement is true with respect to MapReduce . or YARN

1. It is the newer version of MapReduce, using this performance of the data processing can be increased.
2. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker,
resource management and job scheduling or monitoring, into separate daemons.
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above
5. Only 2 and 3 are correct
Ans : 5
Exp : MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker,
resource management and job scheduling or monitoring, into separate daemons. The idea is to have a global ResourceManager (RM)
and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

Question : Which is the component of the ResourceManager

1. 1. Scheduler
2. 2. Applications Manager
3. Access Mostly Uused Products by 50000+ Subscribers
4. 4. All of the above
5. Only 1 and 2 are correct
Ans : 5

Exp : The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities,
queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.

Question :

Scheduler of Resource Manager guarantees about restarting failed tasks either due to application failure or hardware failures.
1. True
2. False

1. True
2. False
Ans : 2
Exp : The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of
capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status
for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware
failures. The Scheduler performs its scheduling function based the resource requirements of the applications;
it does so based on the abstract notion of a resource Container which incorporates elements such as memory,
cpu, disk, network etc.

Question :

Which statement is true about ApplicationsManager

1. is responsible for accepting job-submissions
2. negotiating the first container for executing the application specific ApplicationMaster
and provides the service for restarting the ApplicationMaster container on failure.
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above
5. 1 and 2 are correct
Ans : 5
Exp : The ApplicationsManager is responsible for accepting job-submissions,
negotiating the first container for executing the application specific ApplicationMaster and provides the
service for restarting the ApplicationMaster container on failure.

Question :

NameNode store block locations persistently ?

1. True
2. Flase
Ans : 2
Exp : NameNode does not store block locations persistently, since this information is reconstructed from datanodes when system starts.

Question :

Which tool is used to list all the blocks of a file ?

1. hadoop fs
2. hadoop fsck
3. Access Mostly Uused Products by 50000+ Subscribers
4. Not Possible
Ans : 2

Question : HDFS can not store a file which size is greater than one node disk size :

1. True
2. False
Ans : 2
Exp : It can store because it is divided in block and block can be stored anywhere..

Question : After a file has been written to HDFS, which of the following operations can you perform?

1. You can move the file
2. You can overwrite the file by creating a new file with the same name
3. You can rename the file
4. You can update the file's contents
5. You can delete the file

1. 1,2,3
2. 1,3,4
3. Access Mostly Uused Products by 50000+ Subscribers
4. 2,4,5
5. 2,3,4

Correct Answer : Get Lastest Questions and Answer :

Explanation: HDFS is a write-once filesystem; after a file has been written to HDFS it cannot be modified. In some versions of HDFS you can append to the file, but you can never modify the existing contents. Files can be moved, deleted, or renamed, as these are metadata operations. You cannot overwrite a file with an existing name.

Question : You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the
Mapper's map method?

1. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk
2. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
5. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.

Ans : 3
Exp :The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual
mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop
administrator. The intermediate data is cleaned up after the Hadoop Job completes

Question : Identify the MapReduce v (MRv / YARN) daemon responsible for launching application containers and
monitoring application resource usage?

1. ResourceManager
2. NodeManager
3. Access Mostly Uused Products by 50000+ Subscribers
4. ApplicationMasterService
5. TaskTracker.

Ans : 3
Exp :The fundamental idea of MRv2(YARN)is to split up the two major functionalities of the JobTracker, resource
management and job scheduling/monitoring, into separate daemons. The idea is to have a global
ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the
classical sense of Map-Reduce jobs or a DAG of jobs.

Question : You have user profile records in your OLPT database, that you want to join with web logs you have already
ingested into the Hadoop file system. How will you obtain these user records?

1. HDFS command
2. Pig LOAD command
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hive LOAD DATA command
5. Ingest with Flume agents

Ans : 3
Exp :Apache Hadoop and Pig provide excellent tools for extracting and analyzing data from very large Web logs.
We use Pig scripts for sifting through the data and to extract useful information from the Web logs.
We load the log file into Pig using the LOAD command.
raw_logs = LOAD 'apacheLog.log' USING TextLoader AS (line:chararray);
Note 1:
Data Flow and Components
*Content will be created by multiple Web servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework.
FLUME has agents running on Web servers; these are machines that collect data intermediately using collectors and finally push that data to HDFS.
*Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job solution). These scripts actually analyze
the logs on various dimensions and extract the results. Results from Pig are by default inserted into HDFS, but we can use storage implementation
for other repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please see the implementation section). Pig
Scripts can either push this data to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts can push this data into HBase directly. In this
article, we use scripts to push data onto HDFS, as we are showcasing the Pig framework applicability for log analysis at large scale. *The database HBase will have the data
processed by Pig scripts ready for reporting and further slicing and dicing. *The data-access Web service is a REST-based service that eases the access and integrations with data
clients. The client can be in any language to access REST-based API. These clients could be BI- or UI-based
clients.
Note 2:
The Log Analysis Software Stack
*Hadoop is an open source framework that allows users to process very large data in parallel. It's based on the framework that supports Google search engine. The Hadoop core is
mainly divided into two modules: 1.HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple
commodity servers connected in a cluster. 2.Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is
bonded with HDFS. *The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can
keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives
real-time, random read/write access to very large data sets -- HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple
versions of a single row. *The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework.
It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to
write code in a scripting language. *Flume is a distributed, reliable and available service for collecting, aggregating and moving a large amount of log data (src flume-wiki). It
was built to push large logs into Hadoop-HDFS for further processing. It's a data flow solution, where there is an originator and destination for each node and is divided into
Agent and Collector tiers for collecting logs and pushing them to destination storage.

Question : A client application creates an HDFS file named foo.txt with a replication factor of . Identify which best
describes the file access rules in HDFS if the file has a single block that is stored on data nodes A, B and C?

1. The file will be marked as corrupted if data node B fails during the creation of the file.
2. Each data node locks the local file to prohibit concurrent readers and writers of the file.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Each data node stores a copy of the file in the local file system with the same name as the HDFS file.
5. The file can be accessed if at least one of the data nodes storing the file is available.

Ans : 5
Exp :HDFS keeps three copies of a block on three different datanodes to protect against true data corruption.
HDFS also tries to distribute these three replicas on more than one rack to protect against data availability
issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately
schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is
sufficient to avoid corrupted files.
Note:
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a
sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. An application can
specify the number of replicas of a file. The replication factor can be specified at file creation time and can be
changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all
decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default
configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack
and 3rd copy on a different rack.

Question : Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access
to hundreds of terabytes of data?

1. HBase
2. Hue
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hive
5. Oozie

Ans : 1
Exp : Use Apache HBase when you need random, realtime read/write access to your Big Data.

This HBase goal is the hosting of very large tables

- billions of rows X millions of columns
- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, columnoriented
store modeled after Google's Bigtable:
- A Distributed Storage System for Structured Data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides
Bigtable-like capabilities on top of Hadoop and HDFS.

Features of HBases
- Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
- Extensible jruby-based (JIRB) shell
- Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

Question : Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and
generate Java classes to interact with that imported data?

1. Oozie
2. Flume
3. Access Mostly Uused Products by 50000+ Subscribers
4. Hue
5. Sqoop

Ans : 5

Exp :Sqoop ("SQL-to-Hadoop") is a straightforward command-line tool with the following capabilities:
Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact
with your imported data Provides the ability to import from SQL databases straight into your Hive data
warehouse

Data Movement Between Hadoop and Relational Databases
Data can be moved between Hadoop and a relational database as a bulk data transfer, or relational tables can
be accessed from within a MapReduce map function.
Note:

* Cloudera's Distribution for Hadoop provides a bulk data transfer tool (i.e., Sqoop) that imports individual
tables or entire databases into HDFS files. The tool also generates Java classes that support interaction with
the imported data. Sqoop supports all relational databases over JDBC, and Quest Software provides a
connector (i.e., OraOop) that has been optimized for access to data residing in Oracle databases.

Question : Table metadata in Hive is:

1. Stored as metadata on the NameNode.
2. Stored along with the data in HDFS.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Stored in ZooKeeper.

Ans : 3
Exp :

By default, hive use an embedded Derby database to store metadata information. The metastore is the "glue"
between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what
tables they belong to, etc.

The Metastore is an application that runs on an RDBMS and uses an open source ORM layer called
DataNucleus, to convert object representations into a relational schema and vice versa. They chose this
approach as opposed to storing this information in hdfs as they need the Metastore to be very low latency. The
DataNucleus layer allows them to plugin many different RDBMS technologies.

*By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases
like MySQL can optionally be used.

*features of Hiveinclude:
Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query
execution.

Store Hive Metadata into RDBMS

Question : Someone in your data center unplugs a slave node by accident. Users of the cluster notice via the Hadoop Web UI
that the cluster size has shrunken and express concerns about data loss and HDFS performance.
The replication factor of all the files in the cluster is unchanged from the default of 3. What can you tell the users?

1. The NameNode will re-replicate the data after the administrator issues a special command.
The data is not lost but is under-replicated until the administrator issues this command.
2. After identifying the outage, the NameNode will naturally re-replicate the data and there will be no data loss.
The administrator can re-add the DataNode at any time. The client can disregard warnings concerned with this event.
Data will be under-replicated but will become properly replicated over time.
3. Access Mostly Uused Products by 50000+ Subscribers
failures to the DataNode, so the end users can disregard such warnings.
4. The HDFS filesystem is corrupt until the the administrator re-adds the DataNode to the cluster.
The warnings associated with the event should be reported.
Ans : 2
Exp : HDFS is designed to deal with the loss of DataNodes automatically. Each DataNode heartbeats in to the NameNode every three seconds. If the NameNode does not receive a
heartbeat from the DataNode for a certain amount of time (10 minutes and 30 seconds by default) it begins to re-replicate any blocks which were on the now-dead DataNode. This
re-replication requires no manual input by the system administrator

Data Disk Failure, Heartbeats and Re-Replication

Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode
detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any
data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value.
The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a
DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

Question : You need to move a file titled "weblogs" into HDFS. When you try to copy the file, you can't. You know you
have ample space on your DataNodes. Which action should you take to relieve this situation and store more
files in HDFS?

1. Increase the block size on all current files in HDFS.
2. Increase the block size on your remaining files.
3. Access Mostly Uused Products by 50000+ Subscribers
4. Increase the amount of memory for the NameNode.
5. Increase the number of disks (or size) for the NameNode.

Ans : 3

Exp :

*-put localSrc destCopies the file or directory from the local file system identified by localSrc to dest within the DFS.

*What is HDFS Block size? How is it different from traditional file system block size? In HDFS data is split into
blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size.
Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on
different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block
size can not be compared with the traditional file system block size.

Question : Workflows expressed in Oozie can contain:

1. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks,
decision points, and path joins.
2. Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce sequences can be
combined with forks and path joins.
3. Access Mostly Uused Products by 50000+ Subscribers
handlers but no forks.
4. Iterntive repetition of MapReduce jobs until a desired answer or state is reached.

Ans : 1
Exp :

Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control
dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified
in hPDL (a XML Process Definition Language). hPDL is a fairly compact language, using a limited amount of
flow control and action nodes. Control nodes define the flow of execution and include beginning and end of a
workflow (start, end and fail nodes) and mechanisms to control the workflow execution path ( decision, fork
and join nodes).

Workflow definitions Currently running workflow instances, including instance states and variables

Oozie is a Java Web-Application that runs in a Java servlet-container - Tomcat and uses a database to store

Question : Assuming a large, properly configured multi-rack Hadoop cluster, which scenario should not result in loss of HDFS data?

1. Ten percent of DataNodes simultaneously fail

2. Multiple racks simultaneously fail

3. Access Mostly Uused Products by 50000+ Subscribers

4. Seventy percent of DataNodes simultaneously fail

5. An entire rack fails

Ans : 5
Exp : By default, Hadoop replicates each block three times. The standard rack placement policy, assuming a rack topology script has been defined, is to place the first copy of
the block on a node in one rack, and the remaining two copies on two nodes in a different rack. Because of this, an entire rack failing will not result in the total loss of any
blocks, since at least one copy will be present in a different rack. Losing even ten percent of the nodes across multiple racks could result in all three nodes which contain a
particular block being lost.

Question : Your developers request that you enable them to use Pig on your Hadoop cluster. What do you need to configure and/or install?

1. Install the Pig interpreter on the client machines only.

2. Install the Pig interpreter on the master node which is running the JobTracker.

3. Access Mostly Uused Products by 50000+ Subscribers
4. Install the Pig interpreter on all nodes in the cluster, and the client machines.

Ans : 1
Exp : Pig is a high-level abstraction on top of MapReduce which allows developers to process data on the cluster using a scripting language rather than having to write Hadoop
MapReduce. The Pig interpreter runs on the client; it processes the user's PigLatin script and then submits MapReduce jobs to the cluster based on the script. There is no need to
have anything installed on the cluster; all that is required is the interpreter on the client machine.

Question : Hadoop map reduce based computing engine is ideal for use as a real-time or near real- time processing extension to

1. True
2. False

Correct Answer : Get Lastest Questions and Answer :
Explanation: For real time data processing, you should have used, IBM infosphere and not hadoop Map Reduce. hadoop MapReduce is best fit for Batch oriented data
processing.

IBM Certified Data Architect - Big Data Certification Questions and Answers (Dumps and Practice Questions)

Related Questions