Bigtable data modeling and query with python

Question

This is my first time using BigTable I can't tell if I don't understand bigtable modeling or how to use the python library.

Some background on what I'm storing:

I am storing time series events that let's say have two columns name and message, my rowkey is "#200501163223" so rowkey includes time in this format '%y%m%d%H%M%S'

Let's say later I needed to add another column called "type".

Also, it possible that there can be two events at the same second.

So this is what I end up with if I store 2 events, with the second event having the additional "type" data:


account#200501163223
  Outbox:name                               @ 2020/05/01-17:32:16.412000
    "name1"
  Outbox:name                               @ 2020/05/01-16:41:49.093000
    "name2"
  Outbox:message                            @ 2020/05/01-17:32:16.412000
    "msg1"
  Outbox:message                            @ 2020/05/01-16:41:49.093000
    "msg2"
  Outbox:type                               @ 2020/05/01-16:35:09.839000
    "temp"

When I query this rowkey using python bigtable library, I get back a dictionary with my column names as keys and data as a list of Cell objects

"name" and "message" key would have 2 objects, and "type" would only have one object since it was only part of the second event.

My question is, how do I know which event, 1 or 2 that "type" value of temp belongs to? Is this model just wrong and I have to ensure only one event can be stored under a rowkey which would be hard to do.. or is there a trick I'm missing in the library to be able to associate the events data accordingly?

Billy Jacobson · Accepted Answer · 2020-05-01 19:22:46Z

1

This is a great question tasha, and something I've come across before too, so thanks for asking it.

In Bigtable, there isn't a concept of having the columns be connected from the same write. This can be very helpful to some people by having a lot of flexibility with what you can do with various columns and versions, but in your case it causes this issue.

The best way to handle this is with 2 steps.

Make sure each time you write to a row you use the same timestamp for that write. That would look like this:

    timestamp = datetime.datetime.utcnow()

    row_key = "account#200501163223"

    row = table.direct_row(row_key)
    row.set_cell(column_family_id,
                 "name",
                 "name1",
                 timestamp)
    row.set_cell(column_family_id,
                 "type",
                 "temp",
                 timestamp)

    row.commit()

Then when you are querying your database, you can apply a filter to only get either the latest version or latest N versions, or a scan based on timestamp ranges.

rows = table.read_rows(filter_=row_filters.CellsColumnLimitFilter(2))

Here are a few snippets with examples on how to use a filter with Bigtable reads. They should be added to the documentation soon.

answered May 1, 2020 at 19:22

Billy Jacobson

1,7152 gold badges15 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

kxasha Over a year ago

thank you! I'm going to test out this approach to see if it solves my issue.

kxasha Over a year ago

I think I should be able to achieve what I need by grouping data by timestamp (your example helped to ensure same cells share the same timestamp). The link to filters options was also very helpful!

Billy Jacobson Over a year ago

Happy to help, if you're able to mark this as the accepted answer too that would rock. Thanks!

Collectives™ on Stack Overflow

Bigtable data modeling and query with python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related