PyTorch GPU memory leak during inference

Question

I am trying to encode documents sentence-wise with a huggingface transformer module. I'm using the very small google/bert_uncased_L-2_H-128_A-2 pretrained model with the following code:

def pre_encode_wikipedia(model, tokenizer, device, save_path):
  
  document_data_list = []

  for iteration, document in enumerate(wikipedia_small['text']):
    torch.cuda.empty_cache()

    sentence_embeds_per_doc = [torch.randn(128)]
    attention_mask_per_doc = [1]
    special_tokens_per_doc = [1]

    doc_split = nltk.sent_tokenize(document)
    doc_tokenized = tokenizer.batch_encode_plus(doc_split, padding='longest', truncation=True, max_length=512, return_tensors='pt')

    for key, value in doc_tokenized.items():
      doc_tokenized[key] = doc_tokenized[key].to(device)

    with torch.no_grad():  
      doc_encoded = model(**doc_tokenized)

    for sentence in doc_encoded['last_hidden_state']:
      sentence[0].to('cpu')
      sentence_embeds_per_doc.append(sentence[0])
      attention_mask_per_doc.append(1)
      special_tokens_per_doc.append(0)

    sentence_embeds = torch.stack(sentence_embeds_per_doc)
    attention_mask = torch.FloatTensor(attention_mask_per_doc)
    special_tokens_mask = torch.FloatTensor(special_tokens_per_doc)

    document_data = torch.utils.data.TensorDataset(*[sentence_embeds, attention_mask, special_tokens_mask])
    torch.save(document_data, f'{save_path}{time.strftime("%Y%m%d-%H%M%S")}{iteration}.pt')
    print(f"Document at {iteration} encoded and saved.")

After about 200-300 iterations on my local GTX 1060 3GB I get an error saying that my CUDA memory is full. Running this code on Colab with more GPU RAM gives me a few thousand iterations.

Things I've tried:

Adding torch.cuda.empty_cache() to the start of every iteration to clear out previously held tensors
Wrapping the model in torch.no_grad() to disable the computation graph
Setting model.eval() to disable any stochastic properties that might take up memory
Sending the output straight to CPU in hopes to free up memory

I'm baffled as to why my memory keeps overflowing. I've trained several models of bigger sizes, applying all the standard practices of a training loop (optimizer.zero_grad(), etc.) I've never had this problem. Why does it appear during this seemingly trivial task?

Edit #1 Changing sentence[0].to('cpu') to cpu_sentence = sentence[0].to('cpu') gave me a few thousand iterations before VRAM usage suddenly spiked, causing the run to crash:

I don't think sentence[0].to('cpu') will move your tensor to 'cpu', it will make a copy. Could you check? — Ivan
– Ivan, Commented Jan 26, 2021 at 18:58
Do you get this error also on CUDA after the few 1000 iterations? — cronoik
– cronoik, Commented Jan 27, 2021 at 13:17
Yes same error, I'm assuming it's just because the Colab GPUs have larger VRAM and it takes more iterations to fill up — Marco Moldovan
– Marco Moldovan, Commented Jan 27, 2021 at 15:37

Alex Bravo · Accepted Answer · 2021-01-26 23:28:14Z

1

Can you try replacing

sentence[0].to('cpu')

with

cpu_sentence = sentence[0].to('cpu')

See more info here https://pytorch.org/docs/stable/tensors.html#torch.Tensor.to

answered Jan 26, 2021 at 23:28

Alex Bravo

1,5982 gold badges24 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Marco Moldovan Over a year ago

This seemed to work at first VRAM was reasonable low utilization for a few thousand iterations now. About an order of magnitude more than what I would usually get so something definitely worked but then

RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 3.00 GiB total capacity; 1.95 GiB already allocated; 0 bytes free; 1.98 GiB reserved in total by PyTorch)

reappeared. I'm posting a picture of the VRAM spike in the description.

Alex Bravo Over a year ago

Did you change it like this: cpu_sentence = sentence[0].to('cpu') sentence_embeds_per_doc.append(cpu_sentence)

Marco Moldovan Over a year ago

I did, yes. Got any other idea what I could try?

Alex Bravo Over a year ago

I think you should look into what allocates this much memory: 112.00 MiB

Maxim Bravo Over a year ago

When you import the pretrained model you can do the following: model = ???.from_pretrained("google/bert_uncased_L-2_H-128_A-2") model.to("cpu)

|

Collectives™ on Stack Overflow

PyTorch GPU memory leak during inference

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related