Python multiprocessing on multiple CPUs, GPUs

Question

I have 8 GPUs, 64 CPU cores (multiprocessing.cpu_count()=64)

I am trying to get inference of multiple video files using a deep learning model. I want some files to get processed on each of the 8 GPUs. For each GPU, I want a different 6 CPU cores utilized.

Below python filename: inference_{gpu_id}.py

Input1: GPU_id

Input2: Files to process for GPU_id

from torch.multiprocessing import Pool, Process, set_start_method
try:
     set_start_method('spawn', force=True)
except RuntimeError:
    pass

model = load_model(device='cuda:' + gpu_id) 

def pooling_func(file):
    preds = []
    cap = cv2.VideoCapture(file)
    while(cap.isOpened()):
          ret, frame = cap.read()
          count += 1
          if ret == True:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
                pred = model(frame)[0]
                preds.append(pred)
          else:
                break
    cap.release()
    np.save(file[:-4]+'.npy', preds)

def process_files():

    # all files to process on gpu_id
    files = np.load(gpu_id + '_files.npy') 

    # I am hoping to use 6 cores for this gpu_id, 
    # and a different 6 cores for a different GPU id
    pool = Pool(6) 

    r = list(tqdm(pool.imap(pooling_func, files), total = len(files)))
    pool.close()
    pool.join()

if __name__ == '__main__':
    import multiprocessing
    multiprocessing.freeze_support()
    process_files()

I am hoping to run inference_{gpu_id}.py files on all GPUs simultaneously

Currently, I am able to successfully run it on one GPU, 6 cores, But when I try to run it on all GPUs together, only GPU 0 runs, all others stop giving below error message.

RuntimeError: CUDA error: invalid device ordinal.

The script I am running:

CUDA_VISIBLE_DEVICES=0 inference_0.py

CUDA_VISIBLE_DEVICES=1 inference_1.py

...

CUDA_VISIBLE_DEVICES=7 inference_7.py

Have you try os.environ["CUDA_VISIBLE_DEVICES"]=str(gpu_id)? — Natthaphon Hongcharoen
– Natthaphon Hongcharoen, Commented Jul 25, 2021 at 13:59
@Ivan I delete the other question as this question has more details — charanReddy
– charanReddy, Commented Jul 25, 2021 at 14:18
In the future, please avoid creating unnecessary posts. Some users may be trying to help - which was my case - and will be unable to post their answer if you decide to delete the thread. — Ivan
– Ivan, Commented Jul 25, 2021 at 14:20
Yes, i understand, sorry for that. Thanks for your detailed answer :) — charanReddy
– charanReddy, Commented Jul 25, 2021 at 14:21

Ivan · Accepted Answer · 2025-03-14 16:25:40Z

3

Consider this, if you are not using the CUDA_VISIBLE_DEVICES flag, then all GPUs will be available to your PyTorch process. This means torch.cuda.device_count will return 8 (assuming your version setup is valid). And you will be able to get access to each one of those 8 GPUs with torch.device, via torch.device('cuda:0'), torch.device('cuda:1'), ..., and torch.device('cuda:8').

Now if you are only planning on using one and want to restrict your process to one. then CUDA_VISIBLE_DEVICES=i (where i is the device ordinal) will make it so. In this case torch.cuda will only have access to a single device through torch.device('cuda:0'). It doesn't matter what the actual device ordinal is, the way you access it is through torch.device('cuda:0').

If you allow access to more than one device: let's say n°0, n°4, and n°2, then you would use CUDA_VISIBLE_DEVICES=0,4,2. Consequently you refer to your cuda devices via d0 = torch.device('cuda:0'), d1 = torch.device('cuda:1'), and d2 = torch.device('cuda:2'). In the same order as you defined them with the flag, i.e.:

d0 -> GPU n°0, d1 -> GPU n°4, and d2 -> GPU n°2.

This makes it so you can use the same code and run it on different GPUs without having to change the underlying code where you are referring to the device ordinal.

In summary, what you need to look at is the number of devices you need to run your code. In your case: 1 is enough. You will refer to it with torch.device('cuda:0'). When running your code, however, you will need to specify what that cuda:0 device is, with the flag:

> CUDA_VISIBLE_DEVICES=0 python inference.py
> CUDA_VISIBLE_DEVICES=1 python inference.py
  ...
> CUDA_VISIBLE_DEVICES=7 python inference.py

Do note 'cuda' will default to 'cuda:0'.

edited Mar 14 at 16:25

answered Jul 25, 2021 at 14:18

Ivan

41.3k9 gold badges78 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

charanReddy Over a year ago

I tried this it is working but it is slow in time not sure why. If I run on one GPU it is taking 3 sec/iteration but if I try to run on all GPUs together, on each GPU it is taking 30 sec/iteration

Ivan Over a year ago

That may be related to your data loaders.

charanReddy Over a year ago

There are no dataloaders now, I am just reading video files, doing some processing, writing back again to a new video file

Ivan Over a year ago

No, I meant the data loading process - whether you actually use a DataLoader or not - takes time. If you have many trainings running at the same time it will cause an overload on the CPU, since it will have more difficulty handling all this data reading.

charanReddy Over a year ago

So any tips how to speed up the process?

Collectives™ on Stack Overflow

Python multiprocessing on multiple CPUs, GPUs

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related