Running Python script parallel

Question

I have a huge dataset of videos that I process using a python script called process.py. The problem is it takes a lot of time to process all the dataset which contains 6000 videos. So, I came up with the idea of dividing this dataset for example into 4 and copy the same code to different Python scripts (e.g. process1.py, process2.py, process3.py, process3.py) and run each one on different shells with one portion of the dataset.

My question is would that bring me anything in terms of performance? I have a machine with 10 cores so it would be very beneficial if I could somehow exploit this multicore structure. I heard about multiprocessing module of Python but unfortunately, I don't know much about it and I didn't write my script considering that I would use its features. Is the idea of starting each script in different shells nonsense? Is there a way to choose which core would be used by each script?

Which OS? Windows or Linux, for example.

Anthony Kong
– Anthony Kong

2015-11-04 10:18:23 +00:00
Commented Nov 4, 2015 at 10:18 — Anthony Kong
– Anthony Kong, Commented Nov 4, 2015 at 10:18
Linux (Ubuntu 14.04).

chronosynclastic
– chronosynclastic

2015-11-04 10:21:21 +00:00
Commented Nov 4, 2015 at 10:21 — chronosynclastic
– chronosynclastic, Commented Nov 4, 2015 at 10:21

Anthony Kong · Accepted Answer · 2015-11-07 06:42:57Z

17

The multiprocessing documentation ( https://docs.python.org/2/library/multiprocessing.html) is actually fairly easy to digest. This section (https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers) should be particularly relevant

You definitely do not need multiple copy of the same script. This is an approach you can adopt:

Assume it is the general structure of your existing script (process.py).

def convert_vid(fname):
    # do the heavy lifting
    # ...

if __name__ == '__main__':
   # There exists VIDEO_SET_1 to 4, as mentioned in your question
   for file in VIDEO_SET_1:  
       convert_vid(file)

With multiprocessing, you can fire the function convert_vid in seperate processes. Here is the general scheme:

from multiprocessing import Pool

def convert_vid(fname):
    # do the heavy lifting
    # ...

if __name__ == '__main__':
   pool = Pool(processes=4) 
   pool.map(convert_vid, [VIDEO_SET_1, VIDEO_SET_2, VIDEO_SET_3, VIDEO_SET_4])

answered Nov 7, 2015 at 6:42

Anthony Kong

41.4k52 gold badges192 silver badges325 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Opps_0 Over a year ago

would mind checking (stackoverflow.com/questions/68305077/…). I am trying to apply the idea you gave but I am not sure how I can pass different datasets and dataset name

Collectives™ on Stack Overflow

Running Python script parallel

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related