Python - multiprocessing and text file processing

Question

BACKGROUND: I have a huge file .txt which I have to process. It is a data mining project. So I've split it to many .txt files each one 100MB size, saved them all in the same directory and managed to run them this way:

from multiprocessing.dummy import Pool

for filename in os.listdir(pathToFile):
    if filename.endswith(".txt"):
       process(filename)
    else:
       continue

In Process, I parse the file into a list of Objects, then I apply another function. This is SLOWER than running the whole file AS IS. But for big enough files I won't be able to run at once and I will have to slice. So I want to have threads as I don't have to wait for each process(filename) to finish.

How can I apply it ? I've checked this but I didn't understand how to apply it to my code...

Any help please would be appreciated. I looked here to see how to do this. What I've tried:

pool = Pool(6)
for x in range(6):
    futures.append(pool.apply_async(process, filename))

Unfortunately I realized it will only do the first 6 text files, or will it not ? How can I make it work ? as soon as a thread is over, assign to it another file text and start running.

EDIT:

for filename in os.listdir(pathToFile):
    if filename.endswith(".txt"):
       for x in range(6):
           pool.apply_async(process(filename))
    else:
       continue

pass all your filenames in the loop. 6 means that 6 files will be processed at the same time. But not sure you'll gain speed because of python GIL and threads. You should look at multiprocessing instead. — Jean-François Fabre
– Jean-François Fabre ♦, Commented Feb 1, 2017 at 10:22
@roganjosh, it is the same program so it has to be thread, don't it ? — Hertha BSC fan
– Hertha BSC fan, Commented Feb 1, 2017 at 10:23
@Jean-FrançoisFabre from multiprocessing.dummy import Pool — Hertha BSC fan
– Hertha BSC fan, Commented Feb 1, 2017 at 10:23
No, you can spawn multiple processes using the multiprocessing module. As was said, the GIL in Python means that only one thread can ever execute code at once, so multithreading will not lead to any increase in speed. — roganjosh
– roganjosh, Commented Feb 1, 2017 at 10:24

mata · Accepted Answer · 2017-02-01 11:07:36Z

4

First, using multiprocessing.dummy will only give you a speed increase if your problem is IO bound (when reading the files is the main bottleneck), for CPU intensive tasks (processing the file is the bottleneck) it won't help, in which case you should use "real" multiprocessing.

The problem you describe seems more fit for the use of one of the map functions of Pool:

from multiprocessing import Pool
files = [f for f in os.listdir(pathToFile) if f.endswith(".txt")]
pool = Pool(6)
results = pool.map(process, files)
pool.close()

This will use 6 worker processes to process the list of files and return a list of the return values of the process() function after all files have been processed. Your current example would submit the same file 6 times.

edited Feb 1, 2017 at 11:07

answered Feb 1, 2017 at 10:53

mata

69.4k10 gold badges168 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

roganjosh Over a year ago

Nice, simple answer. Don't you have to close() and join() the pool to access the results?

Hertha BSC fan Over a year ago

I don't have a list of files. I am using for filename in os.list... to access all the .txt files in a specific folder.

mata Over a year ago

@roganjosh no, you don't have to use join() when using map() because when it returns all workers have already completed their tasks. Calling close() allows the workers to terminate, so that's good practice, thx for the hint.

roganjosh Over a year ago

@HerthaBSCfan files is a list comprehension that is giving you a list of file names.

Hertha BSC fan Over a year ago

@roganjosh :( my program now is not finishing. Without pool it runs for 20 minutes. With pool its been running for an hour now and still running...

|

Collectives™ on Stack Overflow

Python - multiprocessing and text file processing

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related