IBM Certified Data Architect - Big Data Certification Questions and Answers (Dumps and Practice Questions)

Question : In Hadoop YARN based cluster, which of the following needs to be configured for High Availabilty?

1. JobTracker

2. TaskTracker

3. Access Mostly Uused Products by 50000+ Subscribers

4. DataNode

Correct Answer : Get Lastest Questions and Answer :
Explanation: The YARN ResourceManager (RM) is responsible for tracking the resources in a cluster and scheduling applications (for example, MapReduce jobs). Before CDH
5, the RM was a single point of failure in a YARN cluster. The RM high availability (HA) feature adds redundancy in the form of an Active/Standby RM pair to remove this single
point of failure. Furthermore, upon failover from the Standby RM to the Active, the applications can resume from their last check-pointed state; for example, completed map tasks
in a MapReduce job are not re-run on a subsequent attempt. This allows events such the following to be handled without any significant performance effect on running applications.:
Unplanned events such as machine crashes
Planned maintenance events such as software or hardware upgrades on the machine running the ResourceManager.
RM HA requires ZooKeeper and HDFS services to be running.

Question : A company has to design a new data system. They will need to support several
OLTP applications. Every three days a batch job will run to load specific data into
a set of 10 large tables (with historical data) where OLAP analytics will be
performed. Performance for both OLTP and OLAP queries is important. Which of
the following designs would you suggest to the company?

1. Use a NoSQL data store such as MongoDB or Cloudant on the cloud to provide needed scalability

2. Use DB2 Data Partition Feature (DPF), partitioning all tables into different partitions

3. Access Mostly Uused Products by 50000+ Subscribers

4. Use DB2 with BLU Acceleration, use columnar store for the 10 tables where Analytics will be run

Correct Answer : Get Lastest Questions and Answer :
Explanation: BLU Acceleration technology combines ease of use with unprecedented storage savings and performance acceleration for analytic workloads. This section
gives an overview of the innovative in-memory, CPU, and I/O optimizations behind the BLU Acceleration technology.

The BLU Acceleration feature is intended for analytic or data mart workloads. Such workloads typically involve regular reporting queries as well as ad-hoc business intelligence
queries that can't be tuned in advance. If your workload is primarily transaction processing, you may want to consider using row-organized tables. For mixed workloads, shadow
tables provide the best of both worlds by maintaining a column-organized copy of row table. Analytic queries against a row table are transparently routed to the shadow table
thereby leveraging all the advantages of BLU acceleration without any change to the application. The following table identifies some of the workload characteristics that are
optimal for column-organized tables and others that are well-suited for row-organized tables.

Question : Which data format stores all of the data in a binary format making the files more
compact, and will even add in markers to help Map Reduce jobs determine where to break large files for more efficient processing?

1. Parquet

2. Avro

3. Access Mostly Uused Products by 50000+ Subscribers

4. Sequence File

5. Map File

Correct Answer : Get Lastest Questions and Answer :
Explanation: Avro is a storage format for data. It stores data by putting data definition with the data allowing for Avro files to be read and interpreted by many
different programs. It stores all of the data in a binary format making the files more compact, and will even add in markers to help Map Reduce jobs find where to break large
files for more efficient processing.

Related Questions

Question : You have been given a below scenario. You have in total TB data. Your data scientists have written queries to access this data.

However,

- Out of 1000 queries 850 queries, access 25 TB data
- Out of 1000 queries only 100 queries will touch 25 TB data out of remaining 75 TB data
- Out of 1000 queries only 50 queries will touch 20TB data out of remaining 50 TB data

Hence, you can say 850 queries are frequently accessing 25TB data.

Which of the following would provide the best value (business benefit) and lowest TCO?

1. Place the entire set of data in a data warehouse with proper partitioning and indexing.

2. Place the entire set of data in a hadoop environment " using commodity HW.

3. Access Mostly Uused Products by 50000+ Subscribers

4. Place the top 25TB of data (used by 85TB of the query) in a data warehouse, and the rest in a hadoop environment.

Question : You have lot of business critical data. Which should be available all the time for applying analytics on it. So you must have lower RPO (recovery point objective).
Which of the following will help you to do this easily.

1. Clustering

2. Database Shadowing

3. Access Mostly Uused Products by 50000+ Subscribers

4. Tape Backup

Question : You need to provision a Hadoop cluster to perform data analysis on customer sales
data to predict which products are more popular. Which of the following solutions
will let you set up your cluster with the most stability in the platform?

1. Purchase specific products from multiple Independent Software Vendors (ISV)
for your requirements in order to take advantage of vendor-specific features
2. Develop your own platform of software components to allow for maximum
customization

3. Access Mostly Uused Products by 50000+ Subscribers

4. Leverage the Open Data Platform (ODP) core to provide a stable base against
which Big Data solutions providers can qualify solutions

Question : You have been processing huge volume of customer sales data of Arinika Retail Solutions.
However, you also need to process Geospatial data, which produce the GeoJSON format. Which of the following would provide a with the desired output in GeoJSON format?

1. Big SQL

2. BigSheets

3. Access Mostly Uused Products by 50000+ Subscribers

4. Text Analytics

5. Apache Pig

Question : Which of the following is correct with regards to IBM GPFS

1. File clones can be created from a regular file or a file in a snapshot using the mmclone command.

2. Use the mmclone command to display status for specified files.

3. Access Mostly Uused Products by 50000+ Subscribers

4. A and B

5. A,B and C

Question : You are working, with an email marketing company, which already have TB of email data, and they are expecting GB data will be added every day. A typical query
can involve pulling in 20 GB of data. It always been an issue to query this data, hence a proper solution is expected. Which of the following can solve this requirement

1. Set up a Hadoop system

2. Utilize de-duplication and compression technology

3. Access Mostly Uused Products by 50000+ Subscribers

4. Create range partitions and proper indexing of data and store in DB2