1

I'm currently training some neural network models and I've found that for some reason the model will sometimes fail before ~200 iterations due to a runtime error, despite there being memory available. The error is:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.76 GiB total capacity; 1.79 GiB already allocated; 3.44 MiB free; 9.76 GiB reserved in total by PyTorch)

Which shows how only ~1.8GB of RAM is being used when there should be 9.76GB available.

I have found that when I find a good seed (just by random searching), and the model gets past the first few hundred iterations, it will generally run fine afterwards. It seems as though the model doesn't have as much memory available very early on in training, but I don't know how to solve this.

6
  • 3
    Try to monitor the GPU allocation while the training is running using e.g. watch -n 0.5 nvidia-smi. You will likely see the GPU memory usage growing beyond your limit. I also recommend calling torch.cuda.reset_peak_memory_stats() before/after training. If you want to dig deeper, this might be relevant: github.com/pytorch/pytorch/issues/35901 Commented Aug 3, 2021 at 11:30
  • 1
    Are you fine tuning a model? Try reducing the number of layers you're training to see if a particular part of your architecture is causing problems. Training from scratch? Try increasing the drop out rate. I doubt these specific recommendations will solve your problem directly but you may gain more insight as to what is contributing to the increasing memory footprint. Just an idea Commented Aug 6, 2021 at 1:20
  • For me, the above error often demands that the batch-size be reduced (especially for computer-vision or other large matrices). Commented Aug 6, 2021 at 16:40
  • I don't think it's a batch size problem as it isn't really a memory issue insofar as the model trains fine after the first few iterations Commented Aug 6, 2021 at 17:09
  • Where are you running this code? locally or cloud service? Commented Aug 7, 2021 at 19:46

1 Answer 1

0

It's worth noting this part of your error 9.76 GiB reserved in total by PyTorch meaning that this memory is not necessarily available. I have had a similar issue before and I would try to empty the cache using torch.cuda.empty_cache(). Potentially you could also try torch.cuda.clear_memory_allocated() to clear the allocated memory. Afterward use the nvidia-smi CLI to test this. A common issue with maxing out on memory is the batch size. I tend to use this method to calculate a reasonable batch size https://stackoverflow.com/a/59923608/10148950.

There is also ways to use the PyTorch library to investigate the memory usage as per this answer: https://stackoverflow.com/a/58216793/10148950

Sign up to request clarification or add additional context in comments.

2 Comments

torch.cuda.empty_cache() should not be used by end users.
@Ivan why do you say that?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.