1

BACKGROUND: I have a huge file .txt which I have to process. It is a data mining project. So I've split it to many .txt files each one 100MB size, saved them all in the same directory and managed to run them this way:

from multiprocessing.dummy import Pool

for filename in os.listdir(pathToFile):
    if filename.endswith(".txt"):
       process(filename)
    else:
       continue

In Process, I parse the file into a list of Objects, then I apply another function. This is SLOWER than running the whole file AS IS. But for big enough files I won't be able to run at once and I will have to slice. So I want to have threads as I don't have to wait for each process(filename) to finish.

How can I apply it ? I've checked this but I didn't understand how to apply it to my code...

Any help please would be appreciated. I looked here to see how to do this. What I've tried:

pool = Pool(6)
for x in range(6):
    futures.append(pool.apply_async(process, filename))

Unfortunately I realized it will only do the first 6 text files, or will it not ? How can I make it work ? as soon as a thread is over, assign to it another file text and start running.

EDIT:

for filename in os.listdir(pathToFile):
    if filename.endswith(".txt"):
       for x in range(6):
           pool.apply_async(process(filename))
    else:
       continue
14
  • pass all your filenames in the loop. 6 means that 6 files will be processed at the same time. But not sure you'll gain speed because of python GIL and threads. You should look at multiprocessing instead. Commented Feb 1, 2017 at 10:22
  • Are you talking about thread pools or process pools? Commented Feb 1, 2017 at 10:22
  • @roganjosh, it is the same program so it has to be thread, don't it ? Commented Feb 1, 2017 at 10:23
  • @Jean-FrançoisFabre from multiprocessing.dummy import Pool Commented Feb 1, 2017 at 10:23
  • 1
    No, you can spawn multiple processes using the multiprocessing module. As was said, the GIL in Python means that only one thread can ever execute code at once, so multithreading will not lead to any increase in speed. Commented Feb 1, 2017 at 10:24

1 Answer 1

4

First, using multiprocessing.dummy will only give you a speed increase if your problem is IO bound (when reading the files is the main bottleneck), for CPU intensive tasks (processing the file is the bottleneck) it won't help, in which case you should use "real" multiprocessing.

The problem you describe seems more fit for the use of one of the map functions of Pool:

from multiprocessing import Pool
files = [f for f in os.listdir(pathToFile) if f.endswith(".txt")]
pool = Pool(6)
results = pool.map(process, files)
pool.close()

This will use 6 worker processes to process the list of files and return a list of the return values of the process() function after all files have been processed. Your current example would submit the same file 6 times.

Sign up to request clarification or add additional context in comments.

7 Comments

Nice, simple answer. Don't you have to close() and join() the pool to access the results?
I don't have a list of files. I am using for filename in os.list... to access all the .txt files in a specific folder.
@roganjosh no, you don't have to use join() when using map() because when it returns all workers have already completed their tasks. Calling close() allows the workers to terminate, so that's good practice, thx for the hint.
@HerthaBSCfan files is a list comprehension that is giving you a list of file names.
@roganjosh :( my program now is not finishing. Without pool it runs for 20 minutes. With pool its been running for an hour now and still running...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.