1

This is my first time using BigTable I can't tell if I don't understand bigtable modeling or how to use the python library.

Some background on what I'm storing:

I am storing time series events that let's say have two columns name and message, my rowkey is "#200501163223" so rowkey includes time in this format '%y%m%d%H%M%S'

Let's say later I needed to add another column called "type".

Also, it possible that there can be two events at the same second.

So this is what I end up with if I store 2 events, with the second event having the additional "type" data:


account#200501163223
  Outbox:name                               @ 2020/05/01-17:32:16.412000
    "name1"
  Outbox:name                               @ 2020/05/01-16:41:49.093000
    "name2"
  Outbox:message                            @ 2020/05/01-17:32:16.412000
    "msg1"
  Outbox:message                            @ 2020/05/01-16:41:49.093000
    "msg2"
  Outbox:type                               @ 2020/05/01-16:35:09.839000
    "temp"


When I query this rowkey using python bigtable library, I get back a dictionary with my column names as keys and data as a list of Cell objects

"name" and "message" key would have 2 objects, and "type" would only have one object since it was only part of the second event.

My question is, how do I know which event, 1 or 2 that "type" value of temp belongs to? Is this model just wrong and I have to ensure only one event can be stored under a rowkey which would be hard to do.. or is there a trick I'm missing in the library to be able to associate the events data accordingly?

1 Answer 1

1

This is a great question tasha, and something I've come across before too, so thanks for asking it.

In Bigtable, there isn't a concept of having the columns be connected from the same write. This can be very helpful to some people by having a lot of flexibility with what you can do with various columns and versions, but in your case it causes this issue.

The best way to handle this is with 2 steps.

  1. Make sure each time you write to a row you use the same timestamp for that write. That would look like this:

        timestamp = datetime.datetime.utcnow()
    
        row_key = "account#200501163223"
    
        row = table.direct_row(row_key)
        row.set_cell(column_family_id,
                     "name",
                     "name1",
                     timestamp)
        row.set_cell(column_family_id,
                     "type",
                     "temp",
                     timestamp)
    
        row.commit()
    
  2. Then when you are querying your database, you can apply a filter to only get either the latest version or latest N versions, or a scan based on timestamp ranges.

    rows = table.read_rows(filter_=row_filters.CellsColumnLimitFilter(2))

Here are a few snippets with examples on how to use a filter with Bigtable reads. They should be added to the documentation soon.

Sign up to request clarification or add additional context in comments.

3 Comments

thank you! I'm going to test out this approach to see if it solves my issue.
I think I should be able to achieve what I need by grouping data by timestamp (your example helped to ensure same cells share the same timestamp). The link to filters options was also very helpful!
Happy to help, if you're able to mark this as the accepted answer too that would rock. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.