Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

Question

Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.

The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.

Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.

I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.

Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?

Norbert · Accepted Answer · 2016-02-28 22:54:22Z

1

While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:

1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time) 2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance. 3) Interpolate in your graph and maybe try out 3 values between your best values of run 2

Then use the final result as your optimal stream bulk insertion size.

Once you have this value, you can add one more step: Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.

This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.

answered Feb 28, 2016 at 22:54

Norbert

6,1043 gold badges19 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user432024 Over a year ago

Yeah I think I will just make my app run multiple times :P So each instance of the app can read different ranges from the DB tables.

Norbert Over a year ago

That works. Depending on how often this process is used, collecting some performance data can be very useful.

user432024 Over a year ago

Well I know for 3000 documents, the times are as follows... Took time reported by ES bulk is 200-300ms and network transfer latency is about 400-500ms. So the total time usually just under 1 second. I'm not questioning so much the performance but rather was wondering if there is any existing tools? So far the custom route is working. Just have to build the application and make robust etc... Basically building a multithreaded ingestion pipeline type of thing it works, but was hopping there was off the shelf thing.

Norbert Over a year ago

Sadly no off the shelf tools as far as I know.

user432024 Over a year ago

I have pushed it to 8000 by running the same app multiple times. Now I just have to insure each app instance is reading different ranges from the Db and I'm good :)

|

Collectives™ on Stack Overflow

Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related