1

Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.

The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.

Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.

I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.

Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?

1 Answer 1

1

While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:

1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time) 2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance. 3) Interpolate in your graph and maybe try out 3 values between your best values of run 2

Then use the final result as your optimal stream bulk insertion size.

Once you have this value, you can add one more step: Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.

This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.

Sign up to request clarification or add additional context in comments.

6 Comments

Yeah I think I will just make my app run multiple times :P So each instance of the app can read different ranges from the DB tables.
That works. Depending on how often this process is used, collecting some performance data can be very useful.
Well I know for 3000 documents, the times are as follows... Took time reported by ES bulk is 200-300ms and network transfer latency is about 400-500ms. So the total time usually just under 1 second. I'm not questioning so much the performance but rather was wondering if there is any existing tools? So far the custom route is working. Just have to build the application and make robust etc... Basically building a multithreaded ingestion pipeline type of thing it works, but was hopping there was off the shelf thing.
Sadly no off the shelf tools as far as I know.
I have pushed it to 8000 by running the same app multiple times. Now I just have to insure each app instance is reading different ranges from the Db and I'm good :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.