Cassandra - Handling partition and bucket for large data size

Question

We have a requirement where application reads file and inserts data in Cassandra database, however the table can grow up to 300+ MB in one shot during the day. The table will have below structure

create table if not exists orders (
id uuid,
record text,
status varchar,
create_date timestamp,
modified_date timestamp,
primary key (status, create_date));

'Status' column can have value [Started, Completed, Done] As per couple of documents on internet, READ performance is best if it's < 100 MB and index should be used on a column that's least modified (so I cannot use 'status' column as index). Also if I use buckets with TWCS as Minutes then there will be lots of buckets and may impact.

So, how can I better make use of partitions and/or buckets for inserting evenly across partitions and reading records with appropriate status.

Thank you in advance.

Well did you take into account the fact that you cannot update the status column? It is a partition key so it can never be updated. if you need to edit it you would need to delete the previous status and reinsert with the new one. To make things worse you'll most probably need to make a read before write and depending on the frequency of the status change you may also get a tombstone problem. Creating an index on a partition key would also not make sense even if it was possible since you have a one column partition so querying by it is most efficient already and recommended. — Mike
– Mike, Commented Feb 11, 2021 at 7:15
Can you please try to explain how the data will change so we can understand the use-case better? — Mike
– Mike, Commented Feb 11, 2021 at 7:16
In order to suggest a change to the data model, I need to understand better how you are using it. — Mike
– Mike, Commented Feb 11, 2021 at 9:45
Sure @Mike, Application will read files and insert in table with status 'Started'. Another application will fetch records where status is 'Started', process them and then either change it to 'Completed' or 'Failure'. There will be a different process that needs to read records where 'status' is 'Failure'. Am open to change primary key to any other column say create date, but is there any way where partition is not overloaded and buckets are also not exhausted. — Wafa Saba
– Wafa Saba, Commented Feb 11, 2021 at 10:44

Mike · Accepted Answer · 2021-02-11 11:19:26Z

From the discussion in the comments it looks like you are trying to use Cassandra as a queue and that is a big anti-pattern.
While you could store data about the operations you've done in Cassandra, you should look for something like Kafka or RabbitMQ for the queuing.

It could look something like this:

Application 1 copies/generates record A;
Application 1 adds the path of A to a queue;
Application 1 upserts to cassandra in a partition based on the file id/path (the other columns can be info such as date, time to copy, file hash etc);
Application 2 reads the queue, find A, processes it and determines if it is a failure or if it's completed;
Application 2 upserts to cassandra information about the processing including the status. You can also have stuff like reason for the failure;
If it is a failure then you can write the path/id to another topic.

So to sum it up, don't try to use Cassandra as a queue, that is a globally accepted anti-pattern. You can and should use Cassandra to persist a log of what you have done, including maybe the results of the processing (if applicable), how files were processed, their result and so on.
Depending on how you would further need to read and use the data in Cassandra you could think about using partitions and buckets based on stuff like, source of the file, type of file etc. If not, you could keep it partitioned by a unique value like the UUID I've seen in your table. Then you could maybe come to get info about it based on that.

Hope this heleped,
Cheers!

Collectives™ on Stack Overflow

Cassandra - Handling partition and bucket for large data size

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related