Pytorch inference CUDA out of memory when multiprocessing

Question

To fully utilize CPU/GPU I run several processes that do DNN inference (feed forward) on separate datasets. Since the processes allocate CUDA memory during the feed forward I'm getting a CUDA out of memory error. To mitigate this I added torch.cuda.empty_cache() call which made things better. However, there are still occasional out of memory errors. Probably due to bad allocation/release timing.

I managed to solve the problem by adding a multiprocessing.BoundedSemaphore around the feed forward call but this introduces difficulties in initializing and sharing the semaphore between the processes.

Is there a better way to avoid this kind of errors while running multiple GPU inference processes?

THN · Accepted Answer · 2021-08-23 10:18:54Z

From my experience of parallel training and inference, it is almost impossible to squeeze the last bit of the GPU memory. Probably the best you can do is to estimate the maximum number of processes that can run in parallel, then restrict your code to run up to that many processes at the same time. Using semaphore is the typical way to restrict the number of parallel processes and automatically start a new process when there is an open slot.

To make it easier to initialize and share semaphore between processes, you can use a multiprocessing.Pool and the pool initializer as follows.

semaphore = mp.BoundedSemaphore(n_process)
with mp.Pool(n_process, initializer=pool_init, initargs=(semaphore,)) as pool:
    # here, each process can access the shared variable pool_semaphore

def pool_init(semaphore):
    global pool_semaphore
    pool_semaphore = semaphore

On the other hand, the greedy approach is to run with a try ... except block in a while loop and keep trying to use GPU. However, this may come with significant performance overhead, so maybe not a good idea.

Collectives™ on Stack Overflow

Pytorch inference CUDA out of memory when multiprocessing

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related