reading multiple files into dask dataframe

Question

I want to read multiple csv files into one single dask dataframe. Due to some reasons some portion of my original data get lost (no clue why?!). I am wondering whats the best method to read them all into dask? I used a for loop though not sure if its correct.

 for file in os.listdir(dds_glob):
    if file.endswith('issued_processed.txt'):
        ddf = dd.read_fwf(os.path.join(dds_glob,file),
                          colspecs=cols,
                          header=None,
                          dtype=object,
                          names=names)

or should I use something like this:

dfs = delayed(pd.read_fwf)('/data/input/*issued_processed.txt',
                           colspecs=cols,
                           header=None,
                           dtype=object,
                           names=names)  
ddf = dd.from_delayed(dfs)

SultanOrazbayev · Accepted Answer · 2021-04-13 23:32:16Z

2

There are at least two approaches:

provide dask.dataframe with a list of files, so using your first snippet it would look like:

file_list = [
    os.path.join(dds_glob,file)
    for file os.listdir(dds_glob) if file.endswith('issued_processed.txt')
]

# other options are skipped for convenience
ddf = dd.read_fwf(file_list)

construct dataframe from delayed objects, which using your second snippet would look like:

# other options are skipped, but can be included after the `file`
dfs = [delayed(pd.read_fwf)(file) for file in file_list] 
ddf = dd.from_delayed(dfs)

The first approach is something that will solve about 82% of the use-cases, but for the other cases you might need to try the second approach or something more involved.

edited Apr 13, 2021 at 23:32

answered Apr 13, 2021 at 22:42

SultanOrazbayev

16.7k3 gold badges25 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mdurant Over a year ago

I like "other use cases" :)

Reza Mirhossein Over a year ago

Thanks, both works smoothly, though I have some other issues for the rest of the computation

Collectives™ on Stack Overflow

reading multiple files into dask dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related