6

I am trying to have a parent python script sent variables to a child script to help me speed-up and automate video analysis.

I am now using the subprocess.Popen() call to start-up 6 instances of a child script but cannot find a way to pass variables and modules already called for in the parent to the child. For example, the parent file would have:

import sys
import subprocess
parent_dir = os.path.realpath(sys.argv[0])
subprocess.Popen(sys.executable, 'analysis.py')

but then import sys; import subprocess; parent_dir have to be called again in "analysis.py". Is there a way to pass them to the child?

In short, what I am trying to achieve is: I have a folder with a couple hundred video files. I want the parent python script to list the video files and start up to 6 parallel instances of an analysis script that each analyse one video file. If there are no more files to be analysed the parent file stops.

1 Answer 1

17

The simple answer here is: don't use subprocess.Popen, use multiprocessing.Process. Or, better yet, multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor.

With subprocess, your program's Python interpreter doesn't know anything about the subprocess at all; for all it knows, the child process is running Doom. So there's no way to directly share information with it.* But with multiprocessing, Python controls launching the subprocess and getting everything set up so that you can share data as conveniently as possible.

Unfortunately "as conveniently as possible" still isn't 100% as convenient as all being in one process. But what you can do is usually good enough. Read the section on Exchanging objects between processes and the following few sections; hopefully one of those mechanisms will be exactly what you need.

But, as I implied at the top, in most cases you can make it even simpler, by using a pool. Instead of thinking about "running 6 processes and sharing data with them", just think about it as "running a bunch of tasks on a pool of 6 processes". A task is basically just a function—it takes arguments, and returns a value. If the work you want to parallelize fits into that model—and it sounds like your work does—life is as simple as could be. For example:

import multiprocessing
import os
import sys

import analysis

parent_dir = os.path.realpath(sys.argv[0])

paths = [os.path.join(folderpath, file) 
         for file in os.listdir(folderpath)]

with multiprocessing.Pool(processes=6) as pool:
    results = pool.map(analysis.analyze, paths)

If you're using Python 3.2 or earlier (including 2.7), you can't use a Pool in a with statement. I believe you want this:**

pool = multiprocessing.Pool(processes=6)
try:
    results = pool.map(analysis.analyze, paths)
finally:
    pool.close()
    pool.join()

This will start up 6 processes,*** then tell the first one to do analysis.analyze(paths[0]), the second to do analysis.analyze(paths[1]), etc. As soon as any of the processes finishes, the pool will give it the next path to work on. When they're all finished, you get back a list of all the results.****

Of course this means that the top-level code that lived in analysis.py has to be moved into a function def analyze(path): so you can call it. Or, even better, you can move that function into the main script, instead of a separate file, if you really want to save that import line.


* You can still indirectly share information by, e.g., marshaling it into some interchange format like JSON and pass it via the stdin/stdout pipes, a file, a shared memory segment, a socket, etc., but multiprocessing effectively wraps that up for you to make it a whole lot easier.

** There are different ways to shut a pool down, and you can also choose whether or not to join it immediately, so you really should read up on the details at some point. But when all you're doing is calling pool.map, it really doesn't matter; the pool is guaranteed to shut down and be ready to join nearly instantly by the time the map call returns.

*** I'm not sure why you wanted 6; most machines have 4, 8, or 16 cores, not 6; why not use them all? The best thing to do is usually to just leave out the processes=6 entirely and let multiprocessing ask your OS how many cores to use, which means it'll still run at full speed on your new machine with twice as many cores that you'll buy next year.

**** This is slightly oversimplified; usually the pool will give the first process a batch of files, not one at a time, to save a bit of overhead, and you can manually control the batching if you need to optimize things or sequence them more carefully. But usually you don't care, and this oversimplification is fine.

Sign up to request clarification or add additional context in comments.

6 Comments

Wow, thanks for the very fast and elaborate answer. I have not been working with python for too long which is why I needed help. My analysis script is about 50 lines of code to set things up and 150 lines for the actual analysis of the video. I could simply move the 50 lines to main script and understand from you I can create a function from the 150 lines to analyse the video and then run the multiprocessing command to simply run multiple of these loops at the same time, is that right? And whenever one video is finished it will go to the next file until it is at the end of paths?
Regarding the number of processes, I simply try to find the optimal number of processes to get all videos analysed in the shortest amount of time. I have a strong mac mini with 4 cores and by testing running multiple of the analysis scripts in separate terminal windows I found 6 to work best in terms of total nr frames analysed/time. However, you have a good point about the nr of processes. How would I be able to let 'multiprocessing' decide how many processes I run simultaneously? Thanks again for your great help!
@JolJols: First, for your second question: Just do multiprocessing.Pool() instead of multiprocessing.Pool(processes=6), and it'll ask the OS, which in practice just means it calls os.cpu_count(), which on OS X I think means the same result as you get from sysctl hw.ncpu, which will probably be 4. My guess is that 6 beats 4 because you're spending a significant but not overwhelming amount of time on I/O rather than CPU (so you can have 4 cores crunching numbers while 2 of them are blocked on the disk reads). If it makes a big enough difference, go ahead and hard-code it.
@JolJols: Now for your first question: It sounds like you understand it correctly. Each time a process in the pool finishes running the function, it'll get the next path and run the function again, and when there are no paths left for any of the processes, they'll all exit, and the main process's map function will return.
@JolJols: OK, you can't use a pool in a with statement before Python 3.3. There is a backport of the newest version of multiprocessing on PyPI somewhere, but a simpler solution is to just close the pool manually. I'll edit the answer to show how.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.