Keeping elasticsearch and database in sync

Question

I am trying to figure out a way to keep my mysql db and elasticsearch db in sync. I have setup a jdbc river using the jprante / elasticsearch-river-jdbc plugin for elasticsearch. When I execute the below request:

curl -XPUT 'localhost:9200/_river/my_jdbc_river/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
    "driver" : "com.mysql.jdbc.Driver",
    "url" : "jdbc:mysql://localhost:3306/MY-DATABASE",
    "user" : "root",
    "password" : "password",
    "sql" : "select * from users",
    "poll" : "1m"
},
"index" : {
    "index" : "test_index",
    "type" : "user"
}
}'

the river starts indexing data, but for some records I get org.elasticsearch.index.mapper.MapperParsingException. Well there is discussion related to this issue here, but I want to know a way to get around this issue.

Is it possible to permanently fix this by creating an explicit mapping for all 'fields' of the 'type' that I am trying to index or is there a better way to solve this issue?

Another question that I have is, when the jdbc-river polls the database again, it seems to re-index the entire data-set(given in sql query) again into ES. I am not sure, but is this done because elasticsearch wants to add fresh data as well as update any changes in the existing data? Is it possible to index only the fresh data, if the table's data is static?

possible duplicate of Ensuring ElasticSearch is in Sync with Database — mahemoff
– mahemoff, Commented Mar 29, 2014 at 18:09

dadoonet · Accepted Answer · 2012-10-04 06:11:40Z

5

Did you look at default mapping? http://www.elasticsearch.org/guide/reference/mapping/dynamic-mapping.html

I think it can help you here.

If you have an insertion date field in your datatable, you can use it to filter what you have to index. See https://github.com/jprante/elasticsearch-river-jdbc#time-based-selecting

HTH

David

answered Oct 4, 2012 at 6:11

dadoonet

14.6k3 gold badges46 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tuomastik · Accepted Answer · 2017-05-31 09:00:57Z

Elastic Search has dropped the river sync concept at all. It is not a recommended path, because usually it doesn't make sense to keep same normalized SQL table structure in document store like Elastic Search.

Say, you have Product as an entity with some attributes, and Reviews on Product entity as a parent child table as Reviews could be multiple on same table.

Products(Id, name, status,... etc)
Product_reviewes(product_id, review_id)
Reviews(id, note, rating,... etc)

In document store you may want to create a single Index with name say product that includes Product{attribute1, attribute1,... Product reviews[review1, review2,...]}

Here is approach of syncing in such setup.

Assumption:

SQL Database(True Source of record)
Elastic Search or any other NoSql Document Store

Solution:

As soon as Update/updates happens in Publish event/events in JMS/AMQP/Database Queue/File System Queue/Amazon SQS etc. either full Product or primary object ID(I would recommend just ID)
Queue consumer should then call the Web Service to get full object if only Primary ID is pushed to Queue or just take the object it self and send the respective changes to Elastic search/NoSQL database.

Collectives™ on Stack Overflow

Keeping elasticsearch and database in sync

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related