Reading multiple files using multiprocessing in Python and concatenating read values

Question

I have 100s of csv files each storing same number of columns. Instead of reading them one at a time I want to implement multiprocessing.

For representation I have created 4 files: Book1.csv, Book2.csv, Book3.csv, Book4.csv and they store numbers 1 though 5 in each of them in column A starting row 1.

I am trying the following:

import pandas as pd
import multiprocessing
import numpy as np

def process(file):
    return pd.read_csv(file)

if __name__ == '__main__':
    loc = r'I:\Sims'
    fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
    p = multiprocessing.Pool()

    for f in fname:
        p.apply_async(process, [f])

    p.close()
    p.join()

I got the idea for above code from the link.

But the above code is not producing the desired result which I expected would be:

1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5

Edit: I want to load each of the file in separate processor and combine the file contents. Since I have 100s of files to load and combine the contents, I was hoping to make the process faster if I was loding 4 files (my PC has 4 processors) at a time.

I don't see that your code is producing any output, let alone the expected output. What are you trying to achieve? How do you want to process the data? — mhawke
– mhawke, Commented Nov 29, 2016 at 23:09
if working with a big amount of tabular data is in your frequent workflow, you could have a look at dask:dask.pydata.org/en/latest — Arco Bast
– Arco Bast, Commented Nov 30, 2016 at 0:06
Your code discards the dataframes after they are returned to the parent process. You could replace the for loop with dataframes = pool.map(process, fname) and get them in a list. Considering the operation is I/O bound and you add overhead passing the dataframe from child to parent, you may find this takes longer than just reading them in 1 process. — tdelaney
– tdelaney, Commented Nov 30, 2016 at 1:03

Prabakar · Accepted Answer · 2020-08-27 12:14:42Z

3

Try this

import pandas as pd
import multiprocessing
import numpy as np

def process(file):
    return pd.read_csv(file)

if __name__ == '__main__':
    loc = r'I:\Sims'
    fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
    

    with multiprocessing.pool(5) as p: #Create a pool of 5 workers
        result = p.map(process, fname)
    print(len(result))

answered Aug 27, 2020 at 12:14

Prabakar

1822 gold badges4 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Reading multiple files using multiprocessing in Python and concatenating read values

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related