To fully utilize CPU/GPU I run several processes that do DNN inference (feed forward) on separate datasets. Since the processes allocate CUDA memory during the feed forward I'm getting a CUDA out of memory error. To mitigate this I added torch.cuda.empty_cache() call which made things better. However, there are still occasional out of memory errors. Probably due to bad allocation/release timing.
I managed to solve the problem by adding a multiprocessing.BoundedSemaphore around the feed forward call but this introduces difficulties in initializing and sharing the semaphore between the processes.
Is there a better way to avoid this kind of errors while running multiple GPU inference processes?