IBM Certified Data Architect - Big Data Certification Questions and Answers (Dumps and Practice Questions)

Question : You need to set up a distributed storage system for being able to process very large
data sets and you want to be able to leverage the Open Data Platform (ODP) Core.
Which one of the following would you use?

1. Apache Spark

2. IBM GPFS

3. Access Mostly Uused Products by 50000+ Subscribers

4. HDFS

Correct Answer : Get Lastest Questions and Answer :
Explanation: GPFS provides high performance by allowing data to be accessed over multiple computers at once. Most existing file systems are designed for a single server
environment, and adding more file servers does not improve performance. GPFS provides higher input/output performance by striping blocks of data from individual files over multiple
disks, and reading and writing these blocks in parallel. Other features provided by GPFS include high availability, support for heterogeneous clusters, disaster recovery, security,
DMAPI, HSM and ILM.

According to (Schmuck and Haskin), a file that is written to the filesystem is broken up into blocks of a configured size, less than 1 megabyte each. These blocks are distributed
across multiple filesystem nodes, so that a single file is fully distributed across the disk array. This results in high reading and writing speeds for a single file, as the
combined bandwidth of the many physical drives is high. This makes the filesystem vulnerable to disk failures -any one disk failing would be enough to lose data. To prevent data
loss, the filesystem nodes have RAID controllers " multiple copies of each block are written to the physical disks on the individual nodes. It is also possible to opt out of
RAID-replicated blocks, and instead store two copies of each block on different filesystem nodes.

Other features of the filesystem include

Distributed metadata, including the directory tree. There is no single "directory controller" or "index server" in charge of the filesystem.
Efficient indexing of directory entries for very large directories. Many filesystems are limited to a small number of files in a single directory (often, 65536 or a similar small
binary number). GPFS does not have such limits.
Distributed locking. This allows for full Posix filesystem semantics, including locking for exclusive file access.
Partition Aware. A failure of the network may partition the filesystem into two or more groups of nodes that can only see the nodes in their group. This can be detected through a
heartbeat protocol, and when a partition occurs, the filesystem remains live for the largest partition formed. This offers a graceful degradation of the filesystem " some
machines will remain working.
Filesystem maintenance can be performed online. Most of the filesystem maintenance chores (adding new disks, rebalancing data across disks) can be performed while the filesystem is
live. This ensures the filesystem is available more often, so keeps the supercomputer cluster itself available for longer.
It is interesting to compare this with Hadoop's HDFS filesystem, which is designed to store similar or greater quantities of data on commodity hardware " that is, datacenters
without RAID disks and a Storage Area Network (SAN).

HDFS also breaks files up into blocks, and stores them on different filesystem nodes.
HDFS does not expect reliable disks, so instead stores copies of the blocks on different nodes. The failure of a node containing a single copy of a block is a minor issue, dealt
with by re-replicating another copy of the set of valid blocks, to bring the replication count back up to the desired number. In contrast, while GPFS supports recovery from a lost
node, it is a more serious event, one that may include a higher risk of data being (temporarily) lost.
GPFS supports full Posix filesystem semantics. HDFS and GFS do not support full Posix compliance.
GPFS distributes its directory indices and other metadata across the filesystem. Hadoop, in contrast, keeps this on the Primary and Secondary Namenodes, large servers which must
store all index information in-RAM.
GPFS breaks files up into small blocks. Hadoop HDFS likes blocks of 64 MB or more, as this reduces the storage requirements of the Namenode. Small blocks or many small files fill
up a filesystem's indices fast, so limit the filesystem's size.

Question : In traditional SAN based storage, to provide High Availability for the data , RAID (redundant array of independent disks) were used.

About RAID : RAID is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy,
performance improvement, or both.

Similar protection have been introduced in Hadoop Distributed File System . Which is known as a

1. NameNode

2. Secondary NameNode

3. Access Mostly Uused Products by 50000+ Subscribers

4. Resource Manager

5. Replication

Correct Answer : Get Lastest Questions and Answer :
Explanation: Hadoop Distributed File System (HDFS) stores files as data blocks and distributes these blocks across the entire cluster. As HDFS was designed to be
fault-tolerant and to run on commodity hardware, blocks are replicated a number of times to ensure high data availability.

Question : You are working with a storage company, which is helping a market reasearch company, which is having access to billions of record. However, this reaserch comapny is
looking for a solution, where they can store this billions of records for some temporary days like 90 days until their analysis finishes and also want to run analytics on that
data. which of the solution recommended from IBM

1. BigInsight

2. Spark

3. Access Mostly Uused Products by 50000+ Subscribers

4. Pure Data System for Analytics

5. SPSS

Correct Answer : Get Lastest Questions and Answer :
Explanation: IBM PureData System for Analytics is a purpose-built, standards-based data warehouse and analytics appliance that integrates database, server, storage and
analytics into an easy-to-manage system. It is designed for high-speed analysis of big data volumes, scaling into the petabytes.

Fast analytics : Provides a high-performance, massively parallel system that enables you to gain insight from your data and perform analytics on very large data volumes.

Smart, efficient queries: Simplifies analytics by consolidating all activity in one place, where the data resides.

Simplified infrastructure : Easy to deploy and manage; simplifies your data warehouse and analytic infrastructure. Does not require tuning, indexing or aggregated tables and needs
minimal administration.

Advanced security : Enhanced data security is provided through self-encrypting drives as well as support for the Kerberos authentication protocol.

Integrated platform : Supports thousands of users, unifying data warehouse, Hadoop and business intelligence with advanced analytics.

Related Questions

Question : You are working as a chief data scientist, in Arinika Inc for a market research company.'
You have a team of data scientist, who knows Python and Machine Learning. Now which of the best tool, which your team can use with their existing skill sets. Hence, learning curve
can be reduced

1. Big Insight

2. Big R

3. Access Mostly Uused Products by 50000+ Subscribers

4. Spark

Question : You are working as Chief Data Architect in a media comapny, where everyday B viewers watch their media programs (News, Movies etc). You have a very well setup, so
that you can continuosly receiving customer behaviour data like their viewing habits, peak usage. Your company has various advertiers, hence you need to have segmentation of
customer data with the public data, such as voter registration, so that more accurate targeted campaigns to sepcific demographics can be launched. Which all technology will be
required in such scenario ?

A. InfoSphere Streams
B. Big Insight
C. PureData for Analytics
D. SPSS
E. Spark
F. BigR

1. A,B
2. C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. A,B,E
5. A,C,F

Question : You have considered to store data , such that they can take lesser space and effeciently processed by Hadoop framework. Hence, you decided to go for parquet data
format. You still want to compress this parquet data, which is the default codec for compressing parquet data?

1. Gzip

2. Snappy

3. Access Mostly Uused Products by 50000+ Subscribers

4. LZO

Question : You need to set up a distributed storage system for being able to process very large
data sets and you want to be able to leverage the Open Data Platform (ODP) Core.
Which one of the following would you use?

1. Apache Spark

2. IBM GPFS

3. Access Mostly Uused Products by 50000+ Subscribers

4. HDFS

Question : In traditional SAN based storage, to provide High Availability for the data , RAID (redundant array of independent disks) were used.

About RAID : RAID is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy,
performance improvement, or both.

Similar protection have been introduced in Hadoop Distributed File System . Which is known as a

1. NameNode

2. Secondary NameNode

3. Access Mostly Uused Products by 50000+ Subscribers

4. Resource Manager

5. Replication

Question : You are working with a storage company, which is helping a market reasearch company, which is having access to billions of record. However, this reaserch comapny is
looking for a solution, where they can store this billions of records for some temporary days like 90 days until their analysis finishes and also want to run analytics on that
data. which of the solution recommended from IBM

1. BigInsight

2. Spark

3. Access Mostly Uused Products by 50000+ Subscribers

4. Pure Data System for Analytics

5. SPSS