44

I have a little problem that needs some suggestions:

  • Lets say we have a few hundred data tables with a few dozen million rows each.
  • Data tables are timestamp(key) - value
  • Data tables are written once every second

The latest entry of each table should be quickly obtainable and will most likely be queried the most (sorta like "follow data in real time"). With the lack of 'Last()' or similar, I was thinking of creating another table "LatestValues" where the latest entry of each data table is updated for a faster retrieval. This, however, would add an extra update for each write operation. Also, most of the traffic would be concentrated on this table (good/bad?). Is there a better solution for this or am I missing something?

Also, lets say we want to query for the values in data tables. Since scanning is obviously out of the question, is the only option left to create a secondary index by duplicating the data, effectively doubling the storaging requirements and the amount write operations? Any other solutions?

I'm primarily looking at DynamoDB and Azure Table Storage, but I'm also curious how BigTable handles this.

0

3 Answers 3

66

I just published an article today with some common "recipes" about DynamoDB. One of them is "Storing article revisions, getting always the latest" I think it might interest you :)

In a nutshell, you can get the latest item using Query(hash_key=..., ScanIndexForward=True, limit=1)

But, this assumes you have a range_key_defined.

With Scan, you have no such parameter as ScanIndexForward=false and anyway, you can not rely on the order as data is spread over partitions and the Scan request is then load balanced.

To achieve you goal with DynamoDB, you may "split" your timestamp this way:

  1. hash_key: date
  2. range_key: time or full timestamp, as you prefer

Then, you can use the 'trick' of Query + Limit=1 + ScanIndexForward=false

Sign up to request clarification or add additional context in comments.

7 Comments

Thank you for the answer and your article was an interesting read. I still have one question regarding ScanIndexForward though. The documentation says: Specifies ascending or descending traversal of the index... Does ScanIndexForward work like GROUP BY and just reverse the order of query results or does it actually read the range_key in reverse order i.e. how many reads does this require? My concern is that by the end of the day (assuming 1 write/s) there is over 86k entries and constantly going through them again and again when getting the most recent value would be expensive.
The range_key is indexed so that it's efficient and, with Query, you pay only for the retrieved results. This said, I don't know how this is implemented internally.
Just a heads up: in my case, I needed ScanIndexForward=False instead of True. The default behaviour probably changed at some point after the article was written. The docs for the query method read: If ScanIndexForward is true , DynamoDB returns the results in order, by range key. This is the default behavior. If ScanIndexForward is false, DynamoDB sorts the results in descending order by range key, and then returns the results to the client.
As of now it seems that ScanIndexForward was replaced by BackwardSearch with the same meaning.Unfortunately I couln't found any documentation for this change.
Looks like the link you have here has expired. I found what could be the same content on blog.yadutaf.fr/2012/10/07/…
|
-1

In general, you probably just want to reverse the timestamp, so it decreases over time, leaving the newest row on top.

Here's a blog post of mine outlining how to do this with Windows Azure storage: http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure.

UPDATE

I use DynamoDB for one project, but in a very simplistic way, so I don't have much experience. That said, http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/QueryAndScan.html suggest to me that you can just specify ScanIndexForward=false and Limit=1 to get the last item.

2 Comments

Thank you for your answer. Hadn't considered of trying to solve it like that, although I'm a bit hesitant on formatting the timestamp as it is 'part of the data' and as such it will be queried and it should be in a format that an user can understand. With this solution I'd have to reprocess every timestamp to reverse the transformation for every query.
I would suggest storing another column with the timestamp in its normal representation.
-6

For folks who found this thread but only care about 1 table:

You can get the latest item from a table in the UI by clicking on the column to sort by those values.

1 Comment

This only sorts the currently page of results (~100 records), presumably to avoid scanning the whole table.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.