Memory issue in running multiple processes on GPU

Question

This question can be viewed related to my other question.

I tried running multiple machine learning processes in parallel (with bash). These are written using PyTorch. After a certain number of concurrent programs (10 in my case), I get the following error:

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

As mentioned in this answer,

...it could occur because the VRAM memory limit was hit (which is rather non-intuitive from the error message).

For my case with PyTorch model training, decreasing batch size helped. You could try this or maybe decrease your model size to consume less VRAM.

I tried the solution mentioned here, to enforce a per-process GPU memory usage limit, but this issue persists.

This problem does not occur with a single process, or a fewer number of processes. Since only one context runs at a single time instant, why does this cause memory issue?

This issue occurs with/without MPS. I thought it could occur with MPS, but not otherwise, as MPS may run multiple processes in parallel.

Yeah, if you ask for too much memory, a computer may crash. This is not GPU specific, you can also try to allocate a 10000000GB array in your CPU and make your code crash. What is your question? — Ander Biguri
– Ander Biguri, Commented Nov 30, 2022 at 17:48
@AnderBiguri As stated, the problem doesn't occur with a single process of the same nature, but with 10 processes running concurrently. Why does this occur, since the GPU runs only 1 process at a time? — muser
– muser, Commented Nov 30, 2022 at 17:49
The GPU is a device purposely designed and built for parallel processing. Why do you think it only does 1 thing at the same time? It will compute one thing at a time, only when that computation is bigger than its processing power, but thats it. Many processes can run on the GPU simultaneously, this is absolutely OK and expected (e.g. you may be running your display and compute, at any time). Check nvidia-smi to see all your different processes running at the same time in the GPU. — Ander Biguri
– Ander Biguri, Commented Nov 30, 2022 at 17:50
@AnderBiguri By simultaneously, do you mean parallelly? I understand why display and compute appear to be happening parallelly, but they are happening sequentially. — muser
– muser, Commented Nov 30, 2022 at 17:55
When the GPU is executing multiple processes (one after the other, for example by pre-emption), is the memory being utilized by multiple processes at the (exact) same time? Even by those that the GPU is not executing at the moment? — muser
– muser, Commented Nov 30, 2022 at 17:57

Robert Crovella · Accepted Answer · 2022-11-30 17:58:00Z

5

Since only one context runs at a single time instant, why does this cause memory issue?

Context-switching doesn't dump the contents of GPU "device" memory (i.e. DRAM) to some other location. If you run out of this device memory, context switching doesn't alleviate that.

If you run multiple processes, the memory used by each process will add up (just like it does in the CPU space) and GPU context switching (or MPS or time-slicing) does not alleviate that in any way.

It's completely expected that if you run enough processes using the GPU, eventually you will run out of resources. Neither GPU context switching nor MPS nor time-slicing in any way affects the memory utilization per process.

answered Nov 30, 2022 at 17:58

Robert Crovella

154k12 gold badges254 silver badges300 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ander Biguri Over a year ago

As usual, Robert has been able to convey with better words what I meant in the comments ;). Thanks.

muser Over a year ago

Thank you. That answers the issue. Are you aware of any solutions to limit this usage (PyTorch or TF specific)? The ones I mentioned in the question don't appear to work.

Ander Biguri Over a year ago

@abs Use less memory? Buy a bigger GPU? make sure you read the available GPU specs, and schedule accordingly?

muser Over a year ago

@AnderBiguri Of course those are possible. I specifically asked solutions to limit the usage.

Robert Crovella Over a year ago

There are many many many questions that are PyT or TF specific, here on SO, that ask about how to deal with GPU out of memory situations. I don't have any secrets to share beyond those. As a practical matter, my expectation is that well before you discovered how to go from running 10 training jobs at the same time to running 100 training jobs at the same time on the same GPU, you would run into other performance limits that would make the benefits of adding more jobs disappear.

|

Collectives™ on Stack Overflow

Memory issue in running multiple processes on GPU

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related