IBM Certified Data Architect - Big Data Certification Questions and Answers (Dumps and Practice Questions)

Question : You have a file bases data source. Where data is continuously added. Now ,you need to import this data in Hadoop , HDFS. Which of the following tool help you to
implement this?

1. Sqoop

2. Pig

3. Access Mostly Uused Products by 50000+ Subscribers

4. Flume

5. BigSQL

Correct Answer : Get Lastest Questions and Answer :
Explanation: Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into
the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability
mechanisms for failover and recovery.

YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.

Question : Which of the following feature is supported by IBM GPFS

1. There is a requirement where logical isolation and physical isolation need to be supported

2. There is a requirement for separate clusters for analytics and databases

3. Access Mostly Uused Products by 50000+ Subscribers

4. There is a need to run the NameNode in a separate high availability environment

Correct Answer : Get Lastest Questions and Answer :
Explanation: GPFS supports local disks on cluster nodes and storage area networks (SANs). Logical isolation and physical isolation are supported so that file sets can
be separate file systems inside of a file system (logical isolation), or can be part of separate storage pools (physical isolation). InfoSphere BigInsights uses a customized
version of GPFS that supports all existing GPFS commands, and provides additional interfaces and commands.
GPFS supports thousands of nodes and petabytes of storage so that you can modify the scale to meet your most demanding needs. Data is replicated on multiple nodes so that no
single point of failure exists, whereas the NameNode is a single point of failure in HDFS. You can push updates asynchronously or synchronously allowing you to choose how you want
to manage changes from a primary system to a secondary system.
If a node fails, changes are replicated to other nodes. When the failed node is operational, GPFS quickly determines which blocks must be recovered. Changes that occured while the
node was down are copied to the previously failed node so that the node is synchronized with other nodes in the cluster.
Applications define their own logical block size by segmenting data into file blocks. Each file block is determined based on the effective block size or chunk size. Applications
can also determine replication layout by using either wide striping over the network, write affinity on a local disk, or a combination of both layouts. Allowing applications to
dictate block size and replication layout provides greater performance and efficiency over HDFS.

Question : You are creating an hadoop based solution. You need to consider archival size of the cluster. Which of the following , you need to consider for deciding the cluster
archival size?

1. Replication factor

2. Number of nodes required

3. Access Mostly Uused Products by 50000+ Subscribers

4. Number of batches

Correct Answer : Get Lastest Questions and Answer :
Explanation: If your minimum size of storage on the cluster is not double the size of your data, then you will end up having under-replicated blocks. Under-replicated
block are those which are replicated < replication factor, so if you're replication factor is 2, you will have blocks will have replication factor of 1.

And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault
tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).

Consider that one of the nodes in your cluster goes down. That node would have some data stored in it and if you do not replicate your data, then a part of your data will not be
available due to the node failure. However, if your data is replicated, the data which was on the node which went down will still be accessible to you from other nodes.

If you do not feel the need to replicate your data, you can always set your replication factor = 1.

Related Questions

Question : You are working in a training institute, who provides online training. However, you also want to use your subscriber Linked, Twitter, Facebook and Google Plus feed.
So that you can recommend them new learning on daily basis. You have dashboard created for each learner

Which of the following combination of technologies can solve this requirement?

A. InfoSphere Streams
B. BigInsights
C. Watson Explorer
D. Pure Data for Analytics
E. SPSS

1. A,B,C
2. B,C,D
3. Access Mostly Uused Products by 50000+ Subscribers
4. A,D,E
5. A,C,E

Question : You have been given some data to analyze. However, you don't have any knowledge of this data yet. Hence, you asked data provider. What type of data this is. What you
want to know ?

1. Master data of data

2. Meta data of data

3. Access Mostly Uused Products by 50000+ Subscribers

4. Speed of data

Question : Which of the following statements is TRUE regarding cloud based solutions?

1. In a Platform as a Service Cloud deployment, the customer chooses the operating system they want to use

2. Automated recovery from hardware or network failures is not possible in a
public cloud implementation, only in a private clouds

3. Access Mostly Uused Products by 50000+ Subscribers

4. Using firewalls to create network boundaries is sufficient for ensuring cloud security

Question : You are working as a solution architect for creating a BigData solutions. You have implemented your solution and its in initial setup, which is being tested.
However, your administrator told you that, there is lot of congestion in network? How can he will be sure on this ?

1. He had observed packet discards

2. Data scientists saw lot of warnings in their system logs

3. Access Mostly Uused Products by 50000+ Subscribers

4. By looking at traffic volume and rates

Question : What is the most important aspect of data center disaster recovery?

1. A complete damage assessment

2. Control of critical assets

3. Access Mostly Uused Products by 50000+ Subscribers

4. Protection of individual life

Question : You have been planning to use NoSQL databases like Cloudant, HBase , Cassandra or MongoDB etc. Being an solution architect, you know that, one of the following
requirement cannot be fulfilled by NoSQL

1. Scalability

2. Reporting

3. Access Mostly Uused Products by 50000+ Subscribers

4. Batch processing