Would using transactions to make sure elasticsearch and postgres data are sync a good idea?

Question

I'm thinking about syncing the data between postgres and elasticsearch by using hooks in the postgres ORM. Is this a practical approach? Or would this be too expensive?

Daniel Lyons · Accepted Answer · 2017-01-04 05:54:35Z

You must be talking about triggers. The honest answer is that it depends very much on your write volume.

If you are under a constant write load, this will probably be a bad idea. ES wants to be fed a large amount of data in one big swath. Generally, you are using ES as an index on top of some other database (such as Postgres) and you can live with ES being slightly stale. There used to be an ES technology called rivers to help with this; I see now it has been deprecated.

I would say you have several options going forward:

If your write volume is not large, either write directly from the application or using triggers.
If your write volume is very large, either follow an event-sourcing approach or batch your updates (or do both; this is called a "lambda architecture" and is described in detail in the fantastic book I Heart Logs)

The event sourcing approach is basically to make your application broadcast a stream of events somehow, and then have two processes listening to that stream: one which writes to Postgres and one that writes to ES. This approach was also advocated in I Heart Logs, though with Kafka being the event stream. I think you can profitably use lots of other options than Kafka, such as AMQP.

The batching approach is old school; basically, have a cron job that runs periodically to copy data from your database to ES. This could be a significant performance improvement if you get a lot of small changes but your overall database size is not huge. This is a stable architecture and very easy to get right (especially since ES is not particularly well-trusted as a long term data storage mechanism; see Aphyr's posts about it for details).

So, triggers. It looks like someone else has tried your approach; it may be worth reaching out to them to see if it ever became production-worthy. Another option I found searching was Zombodb which appears to make Postgres use ES as an indexing service without revealing it.

Personally, I would not add code to Postgres to write to ES, because I would worry about threading problems and connection failures. Your application is probably better positioned to make decisions about handling network failures in talking to ES (which will happen) than having it deep in the database itself. Plus, I wouldn't want to do anything that might destabilize the primary data store. This doesn't mean it's the worst idea, just that I would hesitate to put it into production. The super-compelling advantage here is getting a fairly strong sense that if I wrote something to Postgres, it's in ES, and not really having to care whether I wrote it from this app or that app. Those are nice advantages, but you could easily convince yourself that you're circumventing the CAP theorem when you really aren't, you're just accepting newer, broader failure modes in the name of a stronger consistency model than you probably really need.

The event-sourcing model has similar advantages, just moved up a level: if I wrote to the broadcast channel, then I can assume that it will eventually get loaded by both databases—if the channel is persistent, if the messages get there, etc. But it makes it easier to believe that the two systems will be eventually consistent, which is often a more useful thing with distributed systems than "perfect" consistency (and if you have two services, you probably have a distributed system).

Thank you for the fantastic answer, I will look into your suggestions a little more. I was thinking about handling this on application side using after create hooks from the ORM to send a message to a task queue and then have a lambda function create the actual entry on elasticsearch. But the event sourcing model seems to be a better solution.
@user3791980 What you describe in your comment is a very good approach too.

Collectives™ on Stack Overflow

Would using transactions to make sure elasticsearch and postgres data are sync a good idea?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related