I'm trying to export large set of elasticsearch query results to csv with pandas. I'm taking about minimum of 1 million of records which needs to be exported. It is taking too much of time in execution, while checking the logs I see that pandas is taking too much of time while appending the data.
To achieve this I'm using scan helper function and split it into 5000 chunks, I'm able to extract all the elasticsearch data with infinite scroll with this
My code looks like:
for hit in scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000):
scan_source_data = hit["_source"]
scan_id = hit["_id"]
scan_doc_data = pandas.Series(scan_source_data, name=scan_id)
scan_docs = scan_docs.append(scan_doc_data)
scan_docs.to_csv("/tmp/scandocs.csv", ",")
When I checked the logs:
[2021-08-26 08:10:51,876: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/server-logs/_search?scroll=20h&size=5000 [status:200 request:0.814s]
[2021-08-26 08:13:39,213: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.653s]
[2021-08-26 08:17:50,795: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.813s]
[2021-08-26 08:23:15,433: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.554s]
[2021-08-26 08:30:25,491: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.585s]
[2021-08-26 08:38:42,866: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.483s]
As you can see first 5000 chunk took 3 minutes, next 5000 chunk took 4 minutes, next one 6 minutes, next 8 minutes and this time increases when the size of the append increases.
I tried exporting around 50,000 records in one go it took around one and half hour to get the CSV. While trying to export 200,000 records I get time gap of 1 hour between two scroll queries run in the logs and was running for more than 24 hours.
I'm a newbie to python world and doesn't know much about it, but I'm sure there must be some better way.
I'm unable to find an optimised way to resolve this. Any info on this is appreciated. Thanks.