4

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.

Thanks

2
  • I'd guess that calling compute() is not slow because reading in the data takes long but because merging them might require a shuffle. The default scheduler might not be ideal for this. Did you try other schedulers? Commented Mar 23, 2017 at 10:15
  • I didn't explored other schedulers. I will try now. Thanks a lot Commented Mar 23, 2017 at 13:39

2 Answers 2

4

Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.

One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file

with open(out_filename, 'w') as outfile:
    for in_filename in filenames:
        with open(in_filename, 'r') as infile:
            # if your csv files have headers then you might want to burn a line here with `next(infile)
            for line in infile:
                outfile.write(line + '\n')

If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.

You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html

Sign up to request clarification or add additional context in comments.

2 Comments

I merged the files using inner join on a particular column value using dask. I will try reading multiple .part files as you mentioned in above code snippet instead of using compute() directly. Also I will explore other formats. Thanks a lot
You can probably speed this up using 3 threads, each working with a separate file with a shared queue for the processing of data.
2

As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)

Like @mrocklin said, I recommend using other file formats.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.