16

I have a huge dataset of videos that I process using a python script called process.py. The problem is it takes a lot of time to process all the dataset which contains 6000 videos. So, I came up with the idea of dividing this dataset for example into 4 and copy the same code to different Python scripts (e.g. process1.py, process2.py, process3.py, process3.py) and run each one on different shells with one portion of the dataset.

My question is would that bring me anything in terms of performance? I have a machine with 10 cores so it would be very beneficial if I could somehow exploit this multicore structure. I heard about multiprocessing module of Python but unfortunately, I don't know much about it and I didn't write my script considering that I would use its features. Is the idea of starting each script in different shells nonsense? Is there a way to choose which core would be used by each script?

2
  • Which OS? Windows or Linux, for example. Commented Nov 4, 2015 at 10:18
  • Linux (Ubuntu 14.04). Commented Nov 4, 2015 at 10:21

1 Answer 1

17

The multiprocessing documentation ( https://docs.python.org/2/library/multiprocessing.html) is actually fairly easy to digest. This section (https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers) should be particularly relevant

You definitely do not need multiple copy of the same script. This is an approach you can adopt:

Assume it is the general structure of your existing script (process.py).

def convert_vid(fname):
    # do the heavy lifting
    # ...

if __name__ == '__main__':
   # There exists VIDEO_SET_1 to 4, as mentioned in your question
   for file in VIDEO_SET_1:  
       convert_vid(file)

With multiprocessing, you can fire the function convert_vid in seperate processes. Here is the general scheme:

from multiprocessing import Pool

def convert_vid(fname):
    # do the heavy lifting
    # ...

if __name__ == '__main__':
   pool = Pool(processes=4) 
   pool.map(convert_vid, [VIDEO_SET_1, VIDEO_SET_2, VIDEO_SET_3, VIDEO_SET_4]) 
Sign up to request clarification or add additional context in comments.

1 Comment

would mind checking (stackoverflow.com/questions/68305077/…). I am trying to apply the idea you gave but I am not sure how I can pass different datasets and dataset name

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.