Optimising millions of Elasticsearch query export to csv with pandas

Question

I'm trying to export large set of elasticsearch query results to csv with pandas. I'm taking about minimum of 1 million of records which needs to be exported. It is taking too much of time in execution, while checking the logs I see that pandas is taking too much of time while appending the data.

To achieve this I'm using scan helper function and split it into 5000 chunks, I'm able to extract all the elasticsearch data with infinite scroll with this

My code looks like:

for hit in scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000):
    scan_source_data = hit["_source"]
    scan_id = hit["_id"]
    scan_doc_data = pandas.Series(scan_source_data, name=scan_id)
    scan_docs = scan_docs.append(scan_doc_data)

scan_docs.to_csv("/tmp/scandocs.csv", ",")

When I checked the logs:

[2021-08-26 08:10:51,876: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/server-logs/_search?scroll=20h&size=5000 [status:200 request:0.814s]
[2021-08-26 08:13:39,213: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.653s]
[2021-08-26 08:17:50,795: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.813s]
[2021-08-26 08:23:15,433: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.554s]
[2021-08-26 08:30:25,491: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.585s]
[2021-08-26 08:38:42,866: INFO/ForkPoolWorker-1] POST https://search-52bt6xrhhsi.ap-south-1.es.amazonaws.com:443/_search/scroll [status:200 request:0.483s]

As you can see first 5000 chunk took 3 minutes, next 5000 chunk took 4 minutes, next one 6 minutes, next 8 minutes and this time increases when the size of the append increases.

I tried exporting around 50,000 records in one go it took around one and half hour to get the CSV. While trying to export 200,000 records I get time gap of 1 hour between two scroll queries run in the logs and was running for more than 24 hours.

I'm a newbie to python world and doesn't know much about it, but I'm sure there must be some better way.

I'm unable to find an optimised way to resolve this. Any info on this is appreciated. Thanks.

mitoRibo · Accepted Answer · 2021-08-31 16:10:10Z

1

Edit adding timing print statements to see if the scan call is slow, or the for loop is slow. If it is the scan then you might be out of luck.

import time

with open('/tmp/scandocs.csv','w') as f_out:
    scan_start_time = time.time()
    hits = scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000)
    scan_duration = time.time()-scan_start_time
    print(scan_duration)

    loop_start_time = time.time()
    for hit in hits:
        scan_source_data = hit["_source"]
        scan_id = hit["_id"]
        output_line = '{},'.format(scan_id)
        output_line += ','.join(scan_source_data)
        f_out.write(output_line+'\n')

loop_duration = time.time()-loop_start_time
print(loop_duration)

Before edit:

Appending in pandas is slow (Improve Row Append Performance On Pandas DataFrames). Do you need to have all the scan_doc_data in memory at once? Or can you write it out as soon as you retrieve it? If you can write out immediately, I'd suggest writing directly to an open file handle and creating the CSV yourself line by line:

with open('/tmp/scandocs.csv','w') as f_out:
    for hit in scan(elastic_client, index=index, query=query, scroll='20h', clear_scroll=True, size=5000):
        scan_source_data = hit["_source"]
        scan_id = hit["_id"]
        output_line = '{},'.format(scan_id)
        output_line += ','.join(scan_source_data)
        f_out.write(output_line+'\n')

I haven't tested this so there may be errors

edited Aug 31, 2021 at 16:10

answered Aug 27, 2021 at 19:39

mitoRibo

4,5981 gold badge16 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nitish Kumar Over a year ago

Hey Thanks @mitoRibo, Sorry for late response. I tried your code and it took same amount of time. I observed during this development, as the chunk size grows there is delay in execution. When I changed the chunk size to 500-1000 export were quiet fast. Got 2 million records exported in an hour with this. Upvote for your help

mitoRibo Over a year ago

I'm guessing the scan function call is the source of slowness then. you could test this by adding some timing print statements like I've done in the edit above

Collectives™ on Stack Overflow

Optimising millions of Elasticsearch query export to csv with pandas

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related