Question-: Which of the following is correct way to model your data so that minimum partition can be read while querying? A. Model your data around relationships among the data. B. Model your data around relationships among the objects. C. Model your data around the queries you will be using. D. A,C E. B,C
Answer: C Exp: You should minimize the partition reads and model your data around the queries. Hence, option-3 is correct. For that you first need to find what queries you are going to execute. For instance think below before modeling your data in Cassandra database - What grouping you would be using in your query - In what order you need your data, while extracting it. - What condition you will be using to filter the data. - What uniqueness is needed in the data? Once you decided above than design your data model accordingly. Later on you should not change, if you want to change than consider creating materialized views.
Dev only
Question-: You want to store all the subscriptions detail for the users subscribed the courses on HadoopExam.com. However, we want to group the users based on the course type. Sample data as below.
As you know that Analytics group has the highest number of users like more than a million and BigData group has few thousands of users. Which of the following table design is suitable so that data can be evenly spread across 5 node cluster as well as you would always query data for groups having CourseGroup as part of condition and data should be ordered based on username? Here, hash_prefix holds a prefix of a hash of the username. Which is first byte of the hash modulo four. And also, there are 1000’s of CourseGroup in data. A. CREATE TABLE HE_GROUP ( coursegroup text, username text, email text, first text, last text, location text, hash_prefix int, PRIMARY KEY ((coursegroup), username) )
B. CREATE TABLE HE_GROUP( coursegroup text, username text, email text, first text, last text, location text, hash_prefix int, PRIMARY KEY ((coursegroup, hash_prefix), username) )
C. CREATE TABLE HE_GROUP( coursegroup text, username text, email text, first text, last text, location text, hash_prefix int, PRIMARY KEY ((coursegroup, email ), username) )
D. CREATE TABLE HE_GROUP( coursegroup text, username text, email text, first text, last text, location text, hash_prefix int, PRIMARY KEY ((username, email), username) )
Answer: B Exp: As question is asking that data should be queried based on the coursegroup, hence we should use the coursegroup in the primary key. As well data should be ordered based on username in a partition. Hence, we will be using username as clustering key. In every option we do have this. Option-4 is out because coursegroup is not used as partition key.
Now the point is we have data is not balanced across the group Analytics group has millions of users and BigData group has few thousands of users. And we want data should be balanced across all the nodes in the cluster. Hence, we have to use hash_prefix of the username in partition key, which creates compound primary key.
Here, hash_prefix holds a prefix of a hash of the username. And you can use first byte of the hash modulo four. Both coursegroup and hash_prefix makes the compound partition key. In this case individual group will reside across multiple partition (4 in this case). And data would be evenly spread across the nodes in the cluster.
However, we also have another issue here. You should not have data across the partitions while using single query. But it satisfy the requirement asked in the question. Hence, we always have to find the good balance for particular use cases. If you do a lot of reads and groups don’t get too large, maybe changing the modulo value from four to two would be a good choice. On the other hand, if you do very few reads, but any given group can grow very large, changing from four to ten would be a better choice.
Dev only
Question-: You have been given below sample data
And your data model is as below. There are 1000’s of CourseGroup in data.
CREATE TABLE he_users ( id uuid PRIMARY KEY, username text, email text, location text, first text, last text )
A. This modeling would have reduce the duplication of users across many groups. B. To get the all user info we need to read all the partitions. For 1000 groups, 1000 partition will be read. C. This is good model for heavy read. D. This is a good model for very frequent update of user info.
Answer: A, B, D Exp: Given data model helps in reducing the data duplication for sure. But we have to make sure, does it satisfy our query requirement. As given in the question there 1000’s of groups and we need to extract all the user information then query will read all the 1000’s partition. This is very expensive. If we are optimizing our table for read then this is not a good model. And if we have very few reads and very frequent updates on user info then this data model is good. Hence, whenever we design our data model, we have to make sure that you take read/update into account when designing schema.