I'm thinking about syncing the data between postgres and elasticsearch by using hooks in the postgres ORM. Is this a practical approach? Or would this be too expensive?
1 Answer
You must be talking about triggers. The honest answer is that it depends very much on your write volume.
If you are under a constant write load, this will probably be a bad idea. ES wants to be fed a large amount of data in one big swath. Generally, you are using ES as an index on top of some other database (such as Postgres) and you can live with ES being slightly stale. There used to be an ES technology called rivers to help with this; I see now it has been deprecated.
I would say you have several options going forward:
- If your write volume is not large, either write directly from the application or using triggers.
- If your write volume is very large, either follow an event-sourcing approach or batch your updates (or do both; this is called a "lambda architecture" and is described in detail in the fantastic book I Heart Logs)
The event sourcing approach is basically to make your application broadcast a stream of events somehow, and then have two processes listening to that stream: one which writes to Postgres and one that writes to ES. This approach was also advocated in I Heart Logs, though with Kafka being the event stream. I think you can profitably use lots of other options than Kafka, such as AMQP.
The batching approach is old school; basically, have a cron job that runs periodically to copy data from your database to ES. This could be a significant performance improvement if you get a lot of small changes but your overall database size is not huge. This is a stable architecture and very easy to get right (especially since ES is not particularly well-trusted as a long term data storage mechanism; see Aphyr's posts about it for details).
So, triggers. It looks like someone else has tried your approach; it may be worth reaching out to them to see if it ever became production-worthy. Another option I found searching was Zombodb which appears to make Postgres use ES as an indexing service without revealing it.
Personally, I would not add code to Postgres to write to ES, because I would worry about threading problems and connection failures. Your application is probably better positioned to make decisions about handling network failures in talking to ES (which will happen) than having it deep in the database itself. Plus, I wouldn't want to do anything that might destabilize the primary data store. This doesn't mean it's the worst idea, just that I would hesitate to put it into production. The super-compelling advantage here is getting a fairly strong sense that if I wrote something to Postgres, it's in ES, and not really having to care whether I wrote it from this app or that app. Those are nice advantages, but you could easily convince yourself that you're circumventing the CAP theorem when you really aren't, you're just accepting newer, broader failure modes in the name of a stronger consistency model than you probably really need.
The event-sourcing model has similar advantages, just moved up a level: if I wrote to the broadcast channel, then I can assume that it will eventually get loaded by both databases—if the channel is persistent, if the messages get there, etc. But it makes it easier to believe that the two systems will be eventually consistent, which is often a more useful thing with distributed systems than "perfect" consistency (and if you have two services, you probably have a distributed system).