Merge csv files using dask

Question

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.

Thanks

I'd guess that calling compute() is not slow because reading in the data takes long but because merging them might require a shuffle. The default scheduler might not be ideal for this. Did you try other schedulers? — Arco Bast
– Arco Bast, Commented Mar 23, 2017 at 10:15
I didn't explored other schedulers. I will try now. Thanks a lot — SRB
– SRB, Commented Mar 23, 2017 at 13:39

MRocklin · Accepted Answer · 2017-03-23 11:19:06Z

4

Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.

One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file

with open(out_filename, 'w') as outfile:
    for in_filename in filenames:
        with open(in_filename, 'r') as infile:
            # if your csv files have headers then you might want to burn a line here with `next(infile)
            for line in infile:
                outfile.write(line + '\n')

If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.

You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html

answered Mar 23, 2017 at 11:19

MRocklin

57.5k29 gold badges176 silver badges245 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

SRB Over a year ago

I merged the files using inner join on a particular column value using dask. I will try reading multiple .part files as you mentioned in above code snippet instead of using compute() directly. Also I will explore other formats. Thanks a lot

den.run.ai Over a year ago

You can probably speed this up using 3 threads, each working with a separate file with a shared queue for the processing of data.

vtnate · Accepted Answer · 2019-11-01 19:54:34Z

2

As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)

Like @mrocklin said, I recommend using other file formats.

edited Nov 1, 2019 at 19:54

answered Nov 1, 2019 at 19:38

vtnate

1631 silver badge9 bronze badges

Collectives™ on Stack Overflow

Merge csv files using dask

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related