How to clear GPU memory after PyTorch model training without restarting kernel

Question

I am training PyTorch deep learning models on a Jupyter-Lab notebook, using CUDA on a Tesla K80 GPU to train. While doing training iterations, the 12 GB of GPU memory are used. I finish training by saving the model checkpoint, but want to continue using the notebook for further analysis (analyze intermediate results, etc.).

However, these 12 GB continue being occupied (as seen from nvtop) after finishing training. I would like to free up this memory so that I can use it for other notebooks.

My solution so far is to restart this notebook's kernel, but that is not solving my issue because I can't continue using the same notebook and its respective output computed so far.

Karl · Accepted Answer · 2020-05-10 05:36:01Z

42

The answers so far are correct for the Cuda side of things, but there's also an issue on the ipython side of things.

When you have an error in a notebook environment, the ipython shell stores the traceback of the exception so you can access the error state with %debug. The issue is that this requires holding all variables that caused the error to be held in memory, and they aren't reclaimed by methods like gc.collect(). Basically all your variables get stuck and the memory is leaked.

Usually, causing a new exception will free up the state of the old exception. So trying something like 1/0 may help. However things can get weird with Cuda variables and sometimes there's no way to clear your GPU memory without restarting the kernel.

For more detail see these references:

https://github.com/ipython/ipython/pull/11572

How to save traceback / sys.exc_info() values in a variable?

answered May 10, 2020 at 5:36

Karl

5,9661 gold badge11 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dylan Kerler Over a year ago

"However things can get weird with Cuda variables and sometimes there's no way to clear your GPU memory without restarting the kernel" Wow are you serious? That's really bad...

Chris Hayes Over a year ago

The 1/0 trick worked like a charm for me!

HappyFace Over a year ago

traceback.clear_frames(sys.last_traceback) might work. Jupyter sucks.

Maunish Dave · Accepted Answer · 2022-11-10 14:03:16Z

36

with torch.no_grad():
    torch.cuda.empty_cache()

edited Nov 10, 2022 at 14:03

answered May 9, 2020 at 12:54

Maunish Dave

5597 silver badges9 bronze badges

4 Comments

E.ws Over a year ago

For me it always only worked with with torch.no_grad():

Mohammad Javad Over a year ago

This post should be marked as the right answer! Worked for me.

SnzFor16Min Over a year ago

Curious about why this may work?

Felix Goldberg Jan 7 at 9:32

Wow, this actually works!

prosti · Accepted Answer · 2019-09-18 18:08:58Z

32

If you just set object that uses a lot of memory to None like this:

obj = None

And after that you call

gc.collect() # Python thing

This is how you may avoid restarting the notebook.

If you still would like to see it clear from Nvidea smi or nvtop you may run:

torch.cuda.empty_cache() # PyTorch thing

to empty the PyTorch cache.

edited Sep 18, 2019 at 18:08

answered Sep 9, 2019 at 19:40

prosti

46.9k19 gold badges199 silver badges161 bronze badges

5 Comments

Glyph Over a year ago

I tried model = None and gc.collect() but it didn't clear any GPU memory

Glyph Over a year ago

I usually use nvtop for checking GPU memory. Is that a good way to do it?

prosti Over a year ago

gc.collect is telling Python to do garbage collection, if you use nvidia tools you won't see it clear because PyTorch still has allocated cache, but it makes it available.

prosti Over a year ago

Yeah, torch.cuda.empty_cache() may help you see it clear.

Oscar Rangel Over a year ago

it worked for me, in the same order. 1.- model = None, 2.- gc.collect(), 3.- torch.cuda.empty_cache()

MikeB2019x · Accepted Answer · 2023-04-03 14:03:47Z

5

Apparently you can't clear the GPU memory via a command once the data has been sent to the device. The reference is here in the Pytorch github issues BUT the following seems to work for me.

Context: I have pytorch running in Jupyter Lab in a Docker container and accessing two GPU's [0,1]. Two notebooks are running. The first is on a long job while the second I use for small tests. When I started doing this, repeated tests seemed to progressively fill the GPU memory until it maxed out. I tried all the suggestions: del, gpu cache clear, etc. Nothing worked until the following.

To clear the second GPU I first installed numba ("pip install numba") and then the following code:

from numba import cuda
 
cuda.select_device(1) # choosing second GPU 
cuda.close()

Note that I don't actually use numba for anything except clearing the GPU memory. Also I have selected the second GPU because my first is being used by another notebook so you can put the index of whatever GPU is required. Finally, while this doesn't kill the kernel in a Jupyter session, it does kill the tf session so you can't use this intermittently during a run to free up memory.

edited Apr 3, 2023 at 14:03

answered Apr 3, 2023 at 13:13

MikeB2019x

1,2973 gold badges14 silver badges37 bronze badges

2 Comments

Aray Karjauv Over a year ago

I have a similar issue in Jupyter and your solution actually releases the memory, but after closing the device I cannot access it anymore RuntimeError: CUDA error: invalid argument

MikeB2019x Over a year ago

@ArayKarjauv those are hard to diagnose. I used this nice step-by-step advice here for some problems I once had but I've never had what you're describing.

typicalnobodyprogrammer · Accepted Answer · 2022-06-15 14:55:45Z

1

If you have a variable called model, you can try to free up the memory it is taking up on the GPU (assuming it is on the GPU) by first freeing references to the memory being used with del model and then calling torch.cuda.empty_cache().

answered Jun 15, 2022 at 14:55

typicalnobodyprogrammer

111 bronze badge

Comments

KingRabbit · Accepted Answer · 2024-12-31 08:12:37Z

1

I know the absolute solution

move the model to cpu // model = model.cpu() something like that
del model
with torch.no_grad(): torch.cuda.empty_cache()
import gc
gc.collect()

I solved the oom error by following these steps without restarting the kernel

answered Dec 31, 2024 at 8:12

KingRabbit

112 bronze badges

Comments

iScripters · Accepted Answer · 2019-09-09 17:18:36Z

0

Never worked with PyTorch myself, but Google has several results which all basically say the same.. torch.cuda.empty_cache()

https://forums.fast.ai/t/clearing-gpu-memory-pytorch/14637

https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530

How to clear Cuda memory in PyTorch

answered Sep 9, 2019 at 17:18

iScripters

4213 silver badges13 bronze badges

1 Comment

Glyph Over a year ago

torch.cuda.empty_cache() cleared the most of the used memory but I still have 2.7GB being used. It might be the memory being occupied by the model but I don't know how clear it. I tried model = None and gc.collect() from the other answer and it didn't work.

Alex · Accepted Answer · 2024-12-03 04:55:00Z

0

If I remember correctly this helped me:

If I delete the model, I can reassign the GPU memory.

# model_1 training
del model_1
# model_2 training works

If I try to keep the model, the deep copy retains the connection to the GPU, and I cannot use assigned GPU memory.

import copy
# model_1 training
model_1_save = copy.deepcopy(model_1)
del model_1
# model_2 training memory error

If I want to use the first model later, and train a second model on a GPU :

# model_1 training
model_1.to("cpu")
# model_2 training works
model_2.to("cpu")
model_1.to("cuda")
# model_1 continuing training works

answered Dec 3, 2024 at 4:55

Alex

1011 bronze badge

Comments

yasin gourkani · Accepted Answer · 2025-01-23 10:57:15Z

0

It works for me. ** You should use torch.cuda.empty_cache() at the end of the train process. ** I've written a function that frees the GPU RAM. However, the function's content is irrelevant to clearing the GPU memory.

def empty_cuda_mem(model, data_loader, loss_fn):
    with torch.no_grad():
        x_batch, y_batch = next(iter(train_loader))
        yp = model(x_batch.to(device))
        loss = loss_fn(yp, y_batch.to(device))
        torch.cuda.empty_cache()

answered Jan 23 at 10:57

yasin gourkani

1

Collectives™ on Stack Overflow

How to clear GPU memory after PyTorch model training without restarting kernel

9 Answers 9

3 Comments

4 Comments

5 Comments

2 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

3 Comments

4 Comments

5 Comments

2 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related