Question-: You have been given below sample data with the millions of the rows
While designing data model we have below requirement which needs to be satisfied. - We should be able to query table which can return n newest users in the group. - Data should be evenly stored across the nodes in the cluster. - Each new day there should be a new partition. - Analytics group has huge volume of data compare to any other group. - Query should be something like below SELECT * FROM he_group WHERE coursegroup = ? LIMIT ?
A. CREATE TABLE he_group ( coursegroup text, subs_timeuuid timeuuid, subscribed_date text, username text, email text, first text, last text, location text, PRIMARY KEY ( coursegroup , subs_timeuuid ), subscribed_date ) ) WITH CLUSTERING ORDER BY subs_timeuuid DESC)
B. CREATE TABLE he_group ( coursegroup text, subs_timeuuid timeuuid, subscribed_date text, username text, email text, first text, last text, location text, PRIMARY KEY (coursegroup ), subscribed_date , subs_timeuuid ) ) WITH CLUSTERING ORDER BY subs_timeuuid DESC)
C. CREATE TABLE he_group ( coursegroup text, subs_timeuuid timeuuid, subscribed_date text, username text, email text, first text, last text, location text, PRIMARY KEY ( coursegroup , subscribed_date ), subs_timeuuid ) ) WITH CLUSTERING ORDER BY subs_timeuuid DESC)
D. CREATE TABLE he_group ( coursegroup text, subs_timeuuid timeuuid, subscribed_date text, username text, email text, first text, last text, location text, PRIMARY KEY ( coursegroup , subscribed_date ), subs_timeuuid ) ) WITH CLUSTERING ORDER BY subscribed_date DESC)
Answer: C Exp: As question is clearly saying that they wanted to query the n number of newest users in the group and not across the group. Hence, obvious column to think for ordering is timeuuid. As newest users, hence it should be order by desc for column subs_timeuuid. Hence, we can discard option-4
Each new day there should be a new partition. Hence, we should have subscribe_date as part of partition key. We can discard option-1 as well. Also we want to query data for each individual group and data should be sorted. Hence, we should also have coursegroup as part of partition key. Hence, option-3 satisfies the given requirement.
Dev Only
Question-: You are designing a table with the columns (A, B, C, D,E) and you defined key as below PRIMARY KEY (A, B, C) Which of the following statement is true?
A. Columns A, B are partition key B. Column A is a partition key C. Columns B, C is a composite clustering key D. Column C is a clustering key
Answer: B, C Exp: Let’s see few basic concepts regarding primary, Partition and clustering key concept.
- To identify a row uniquely we need to use primary key. - If primary key is made of more than one column than it is known as a composite key - Partition key helps in finding the physical location of data in the cluster ring. - Clustering key is part of primary key and not a partition key. This key is used for ordering data under each partition. Lets see few examples how the Primary keys are defined.
- Primary Key (A) : Column A is a partition key. - Primary Key (A, B) : Here Column A is a partition key and Column B is a Clustering key. - Primary Key ((A,B)) : Here both columns A and B are considered as composite partition key. - Primary Key (A,B,C) : Here Column A is a partition key, while (B,C) is a composite clustering key. - Primary Key ((A,B),C) : Here column A,B are part of composite partition key, while Column C is a clustering Key. - Primary Key ((A,B), C, D) : Here Column (A,B) are considered as composite partition key, while column (C,D) are considered composite clustering key.
Admin/Dev both
Question-: Which of the following should be taken care while designing the data model in Apache Cassandra?
A. Data should be evenly distributed across the node in the cluster. B. While reading the data, we have to make sure that minimum partitions are read. C. While reading the data, we have to make sure as much as possible partitions (try to maximize it) are read. D. Data duplication is encouraged to avoid multiple table read.
Answer: A,B,D Exp: While designing Data model in Cassandra we have to take care following things
1. Spread Data Evenly across the cluster. Based on the hash key value of partition key data will be distributed across the cluster. We have to define the partition key such a way that data would be evenly distributed across the cluster. 2. While reading the data, we have to make sure that minimum number of partition accessed. Ideally only 1. Because in most of the cases partition resides across the nodes in the cluster. And query coordinator node issue separate command for reading the data from different partition. 3. While designing data model we can thing per table per query pattern, if we have more than one table then we can think of duplicating the data in many table.