Datastax Cassandra Administrator Certification Questions and Answer (Pratice Questions and Dumps)

Question-: Which of the following is correct way to model your data so that minimum partition can be read while querying?
A. Model your data around relationships among the data.
B. Model your data around relationships among the objects.
C. Model your data around the queries you will be using.
D. A,C
E. B,C

Answer: C
Exp: You should minimize the partition reads and model your data around the queries. Hence, option-3 is correct. For that you first need to find what queries you are going to execute. For instance think below before modeling your data in Cassandra database
- What grouping you would be using in your query
- In what order you need your data, while extracting it.
- What condition you will be using to filter the data.
- What uniqueness is needed in the data?
Once you decided above than design your data model accordingly. Later on you should not change, if you want to change than consider creating materialized views.

Dev only

Question-: You want to store all the subscriptions detail for the users subscribed the courses on HadoopExam.com. However, we want to group the users based on the course type. Sample data as below.

As you know that Analytics group has the highest number of users like more than a million and BigData group has few thousands of users. Which of the following table design is suitable so that data can be evenly spread across 5 node cluster as well as you would always query data for groups having CourseGroup as part of condition and data should be ordered based on username?
Here, hash_prefix holds a prefix of a hash of the username. Which is first byte of the hash modulo four.
And also, there are 1000’s of CourseGroup in data.
A. CREATE TABLE HE_GROUP (
coursegroup text,
username text,
email text,
first text,
last text,
location text,
hash_prefix int,
PRIMARY KEY ((coursegroup), username)
)

B. CREATE TABLE HE_GROUP(
coursegroup text,
username text,
email text,
first text,
last text,
location text,
hash_prefix int,
PRIMARY KEY ((coursegroup, hash_prefix), username)
)

C.
CREATE TABLE HE_GROUP(
coursegroup text,
username text,
email text,
first text,
last text,
location text,
hash_prefix int,
PRIMARY KEY ((coursegroup, email ), username)
)

D.
CREATE TABLE HE_GROUP(
coursegroup text,
username text,
email text,
first text,
last text,
location text,
hash_prefix int,
PRIMARY KEY ((username, email), username)
)

Answer: B
Exp: As question is asking that data should be queried based on the coursegroup, hence we should use the coursegroup in the primary key. As well data should be ordered based on username in a partition. Hence, we will be using username as clustering key. In every option we do have this.
Option-4 is out because coursegroup is not used as partition key.

Now the point is we have data is not balanced across the group Analytics group has millions of users and BigData group has few thousands of users. And we want data should be balanced across all the nodes in the cluster. Hence, we have to use hash_prefix of the username in partition key, which creates compound primary key.

Here, hash_prefix holds a prefix of a hash of the username. And you can use first byte of the hash modulo four. Both coursegroup and hash_prefix makes the compound partition key. In this case individual group will reside across multiple partition (4 in this case). And data would be evenly spread across the nodes in the cluster.

However, we also have another issue here. You should not have data across the partitions while using single query. But it satisfy the requirement asked in the question. Hence, we always have to find the good balance for particular use cases. If you do a lot of reads and groups don’t get too large, maybe changing the modulo value from four to two would be a good choice. On the other hand, if you do very few reads, but any given group can grow very large, changing from four to ten would be a better choice.

Dev only

Question-: You have been given below sample data

And your data model is as below. There are 1000’s of CourseGroup in data.

CREATE TABLE he_users (
id uuid PRIMARY KEY,
username text,
email text,
location text,
first text,
last text
)

CREATE TABLE he_groups (
coursegroup text,
user_id uuid,
PRIMARY KEY coursegroup , user_id)
)

Which of the following statements are correct?

A. This modeling would have reduce the duplication of users across many groups.
B. To get the all user info we need to read all the partitions. For 1000 groups, 1000 partition will be read.
C. This is good model for heavy read.
D. This is a good model for very frequent update of user info.

Answer: A, B, D
Exp: Given data model helps in reducing the data duplication for sure. But we have to make sure, does it satisfy our query requirement. As given in the question there 1000’s of groups and we need to extract all the user information then query will read all the 1000’s partition. This is very expensive. If we are optimizing our table for read then this is not a good model. And if we have very few reads and very frequent updates on user info then this data model is good. Hence, whenever we design our data model, we have to make sure that you take read/update into account when designing schema.

Dev Only

Related Questions

Question-: Please order the following in the form of read path for Cassandra?
1. Check partition key cache
2. Check Row Cache
3. Locate the data on disk using compression offset map
4. Fetch the data from SSTables on disk
5. Go directly to the compression offset map if a partition key is found in the partition key cache.
6. Check memtable.
7. Check Bloom filter.

A. 1,2,3,4,5,6,7
B. 6,2,7,1,5,3,4
C. 1,4,3,5,2,7,6
D. 1,7,2,4,5,3,6
E. 6,7,2,1,3,5,4

Question-: Which of the following statement is correct for the Bloom filter?
A. Each SSTable has an associated Bloom filter.
B. It is always sure that all SSTables identified by the Bloom Filter will have requested partition data.
C. Bloom filter is stored on heap, so that faster access is possible.
D. Bloom filter can grow up to 1-2 GB for per billion partitions.

Question-: Which of the following statements are valid?
A. Partition key cache “hit�? save one seek during the write operations.
B. If a partition key is found in the Partition Key cache then, engine directly go to the compression offset map to fund the compressed block on disk that has the data.
C. If partition key is not found in the key cache, then partition summary will be checked.
D. Partition summary stores the sampling of partition index.
E. If partition keys are found in partition summary, partition index would be searched.
F. The compression offset map stores pointers to the exact location on disk where the desired partition data will be found.

Question-: LeveledCompactionStrategy (LCS) tends to cause data fragmentation with the read-intensive workloads?
A. True
B. False

Question-: As per the CAP theorem, Cassandra can be setup?
A. Highly consistent with Partition tolerance (CP)
B. Highly available and Partition Tolerance (AP)
C. Highly consistent and highly available (CA)

Question-: You have been given below information

- R = Consistency level for read operation
- W= Consistency level for the write operation
- N is the number of replicas
Which of the following would help in having highly or strong consistency level?
A. R + W >N
B. R + W = N
C. R + W < N
D. R + W >= N