3

I have 100s of csv files each storing same number of columns. Instead of reading them one at a time I want to implement multiprocessing.

For representation I have created 4 files: Book1.csv, Book2.csv, Book3.csv, Book4.csv and they store numbers 1 though 5 in each of them in column A starting row 1.

I am trying the following:

import pandas as pd
import multiprocessing
import numpy as np

def process(file):
    return pd.read_csv(file)

if __name__ == '__main__':
    loc = r'I:\Sims'
    fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
    p = multiprocessing.Pool()

    for f in fname:
        p.apply_async(process, [f])

    p.close()
    p.join()

I got the idea for above code from the link.

But the above code is not producing the desired result which I expected would be:

1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5

Edit: I want to load each of the file in separate processor and combine the file contents. Since I have 100s of files to load and combine the contents, I was hoping to make the process faster if I was loding 4 files (my PC has 4 processors) at a time.

4
  • 1
    I don't see that your code is producing any output, let alone the expected output. What are you trying to achieve? How do you want to process the data? Commented Nov 29, 2016 at 23:09
  • if working with a big amount of tabular data is in your frequent workflow, you could have a look at dask:dask.pydata.org/en/latest Commented Nov 30, 2016 at 0:06
  • 1
    Your code discards the dataframes after they are returned to the parent process. You could replace the for loop with dataframes = pool.map(process, fname) and get them in a list. Considering the operation is I/O bound and you add overhead passing the dataframe from child to parent, you may find this takes longer than just reading them in 1 process. Commented Nov 30, 2016 at 1:03
  • @ tdelaney what do you mean by "reading them in 1 process"? Commented Nov 30, 2016 at 2:03

1 Answer 1

3

Try this

import pandas as pd
import multiprocessing
import numpy as np

def process(file):
    return pd.read_csv(file)

if __name__ == '__main__':
    loc = r'I:\Sims'
    fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
    

    with multiprocessing.pool(5) as p: #Create a pool of 5 workers
        result = p.map(process, fname)
    print(len(result))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.