GPU Runtime Error when memory is available

Question

I'm currently training some neural network models and I've found that for some reason the model will sometimes fail before ~200 iterations due to a runtime error, despite there being memory available. The error is:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.76 GiB total capacity; 1.79 GiB already allocated; 3.44 MiB free; 9.76 GiB reserved in total by PyTorch)

Which shows how only ~1.8GB of RAM is being used when there should be 9.76GB available.

I have found that when I find a good seed (just by random searching), and the model gets past the first few hundred iterations, it will generally run fine afterwards. It seems as though the model doesn't have as much memory available very early on in training, but I don't know how to solve this.

Try to monitor the GPU allocation while the training is running using e.g. watch -n 0.5 nvidia-smi. You will likely see the GPU memory usage growing beyond your limit. I also recommend calling torch.cuda.reset_peak_memory_stats() before/after training. If you want to dig deeper, this might be relevant: github.com/pytorch/pytorch/issues/35901 — adeelh
– adeelh, Commented Aug 3, 2021 at 11:30
Are you fine tuning a model? Try reducing the number of layers you're training to see if a particular part of your architecture is causing problems. Training from scratch? Try increasing the drop out rate. I doubt these specific recommendations will solve your problem directly but you may gain more insight as to what is contributing to the increasing memory footprint. Just an idea — VanBantam
– VanBantam, Commented Aug 6, 2021 at 1:20
For me, the above error often demands that the batch-size be reduced (especially for computer-vision or other large matrices). — Arnab De
– Arnab De, Commented Aug 6, 2021 at 16:40
I don't think it's a batch size problem as it isn't really a memory issue insofar as the model trains fine after the first few iterations — 7koFnMiP
– 7koFnMiP, Commented Aug 6, 2021 at 17:09

BCoxford · Accepted Answer · 2021-08-08 12:53:39Z

0

It's worth noting this part of your error 9.76 GiB reserved in total by PyTorch meaning that this memory is not necessarily available. I have had a similar issue before and I would try to empty the cache using torch.cuda.empty_cache(). Potentially you could also try torch.cuda.clear_memory_allocated() to clear the allocated memory. Afterward use the nvidia-smi CLI to test this. A common issue with maxing out on memory is the batch size. I tend to use this method to calculate a reasonable batch size https://stackoverflow.com/a/59923608/10148950.

There is also ways to use the PyTorch library to investigate the memory usage as per this answer: https://stackoverflow.com/a/58216793/10148950

edited Aug 8, 2021 at 12:53

answered Aug 8, 2021 at 12:47

BCoxford

414 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ivan Over a year ago

torch.cuda.empty_cache() should not be used by end users.

Aaron Meese Over a year ago

@Ivan why do you say that?

Collectives™ on Stack Overflow

GPU Runtime Error when memory is available

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related