I have an instance of ElasticSearch running on a server. When I try to index a huge corpus using multiprocessing, I get a lot of timeout errors. It seems that the EasticSearch can handle only a few numbers of requests. I've followed the configuration suggested in the ElasticSearch website. Are there any suggestions on what should I do to increase its indexing performance for a multiprocessing setting? The index that I'm adding documents to has one shard.
-
1We got very few details on the configuration and the bottleneck can come from a lot of points. First change, change the index.refresh_interval to -1 when first indexing(and re change it after first ingestion). But as you work in localhost, I guess you are doing a lot af IO on the same HDD or your RAM is full and your computer is swappingJaycreation– Jaycreation2020-10-08 05:03:10 +00:00Commented Oct 8, 2020 at 5:03
1 Answer
There are plenty of works that you can do.
First, you need to set refresh_interval. Refresh interval is the time that the added document will become available for search. If you can tolerate set it to at least 30 seconds or -1. I have read that this will increase the indexing performance by about 70%.
The second thing that you can try is to use bulk index API instead of a single document indexing.
Disabling swap can make an upper performance for you in some special cases.
One of the other options that you can try is to increase the RAM size that you have assigned to your elasticsearch;
Finally, increasing the size of HEAP to be used for indexing can increase the writing performance. the default size is 10 percent of all heap size.
I hope these points could help you.