I want to import some tables from a postgres database into Elastic search and also hold the tables in sync with the data in elastic search. I have looked at a course on udemy, and also talked with a colleague who has a lot of experience with this issue to see what the best way to do it is. I am surprised to hear from both of them, it seems like the best way to do it, is to write code in python, java or some other language that handles this import and sync it which brings me to my question. Is this actually the best way to handle this situation? It seems like there would be a library, plugin, or something that would handle the situation of importing data into elastic search and holding it in sync with an external database. What is the best way to handle this situation?
-
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.Mickael– Mickael2018-03-22 11:07:13 +00:00Commented Mar 22, 2018 at 11:07
-
Sorry I know and Im selfish so put this question up.slipperypete– slipperypete2018-03-22 14:04:26 +00:00Commented Mar 22, 2018 at 14:04
-
I downvote because this question doesn't show any research effort (idownvotedbecau.se/noresearch). I searched on Google "elasticsearch postgresql" and I found in top results that a ElasticSearch JDBC existed. Then I seached "elasticsearch jdbc postgresql" and it gave me the link alkis indicated in his answer. It took me 1 minute...Mickael– Mickael2018-03-22 14:55:48 +00:00Commented Mar 22, 2018 at 14:55
3 Answers
It depends on your use case. A common practice is to handle this on the application layer. Basically what you do is to replicate the actions of one db to the other. So for example if you save one entry in postgres you do the same in elasticsearch.
If you do this however you'll have to have a queuing system in place. Either the queue is integrated on your application layer, e.g. if the save in elasticsearch fails then you can replay the operation. Moreover on your queuing system you'll implement a throttling mechanism in order to not overwhelm elasticsearch. Another approach would be to send events to another app (e.g. logstash etc), so the throttling and persistence will be handled by that system and not your application.
Another approach would be this https://www.elastic.co/blog/logstash-jdbc-input-plugin. You use another system that "polls" your database and sends the changes to elasticsearch. In this case logstash is ideal since it's part of the ELK stack and it has a great integration. Check this too https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html
Another approach is to use the NOTIFY mechanism of postgres to send events to some queue that will handle saving the changes in elasticsearch.
Comments
There is a more recent tool called "abc", developped by appbase.io It's performance is uncomparable with logstash: - abc is based on go - logstash is jruby
Anybody who's ever used logstash knows that it takes at least 20 seconds just to start.
The same basic table import task from postgresql to elasticsearch takes ~1 min on logstash, and 5 seconds with abc
Pros:
- Performance
- Performance
- Simplicity (no conf)
Cons:
- More adapted for one-shot imports, the daemon mode is limited
- Less middlewares (logstash filters) as you are required to write a transform.js file that manually changes events
Comments
As anything in life,best is subjective. Your colleague likes to write and maintain code to keep this in sync. There's nothing wrong with that.
I would say the best way would be to use some data pipeline. There's plethora of choices, really overwheleming, you can explore the various solutions which support Postgres and ElasticSearch. Here are options I'm familiar with.
Note that these are tools/platform for your solution, not the solution itself. YOU have to configure, customize and enhance them to fit your definition of in sync