PyTorch Gradient not Working after adding to("cuda:0")

Question

I'm going to give a context-free set of code. The code works before adding "to(device)".

def get_input_layer(word_idx) :
x = torch.zeros(vocabulary_size).float().to(device)
x[word_idx] = 1.0
return x

embedding_dims = 5
device = torch.device("cuda:0")
W1 = Variable(torch.randn(embedding_dims, vocabulary_size).float(), requires_grad=True).to(device)
W2 = Variable(torch.randn(vocabulary_size, embedding_dims).float(), requires_grad=True).to(device)
num_epochs = 100
learning_rate = 0.001

x = Variable(get_input_layer(data)).float().to(device)
y_true = Variable(torch.from_numpy(np.array([target])).long()).to(device)
z1 = torch.matmul(W1, x).to(device)
z2 = torch.matmul(W2, z1).to(device)

log_softmax = F.log_softmax(z2, dim=0).to(device)

loss = F.nll_loss(log_softmax.view(1,-1), y_true).to(device)
loss_val += loss.data
loss.backward().to(device)

## Optimize values. This is done by hand rather than using the optimizer function
W1.data -= learning_rate * W1.grad.data
W2.data -= learning_rate * W2.grad.data

I get

Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'data'

Which triggers specifically on the line

W1.data -= learning_rate * W1.grad.data

Checking, this is confirmed because W1.grad is None for some reason.

And this loops after clearing the gradients. This works just fine if I remove all of the .to(device). What is it that I'm doing wrong in trying to run this on my GPU?

Thank you for your time.

jodag · Accepted Answer · 2020-08-03 22:10:32Z

This happens because .to returns a new, non-leaf tensor. You should set requires_grad after transferring to the desired device. Also, the Variable interface has been deprecated for a long time, since before pytorch 1.0. It doesn't do anything (except in this case act as an overly complicated way to set requires_grad).

Consider

W1 = Variable(torch.randn(embedding_dims, vocabulary_size).float(), requires_grad=True).to(device)

The problem here is that there are two different tensors. Breaking it down we could rewrite what you're doing as

W1a = Variable(torch.randn(embedding_dims, vocabulary_size).float(), requires_grad=True)
W1 = W1a.to(device)

Observe that W1a requires a gradient but W1 is derived from W1a so it isn't considered a leaf tensor, therefore the .grad attribute of W1a will be updated but W1 won't be. In your code you no longer have a direct reference to W1a so you won't have access the gradients.

Instead you can do

W1 = torch.randn(embedding_dims, vocabulary_size).float().to(device)
W1.required_grad_(True)

which will properly set W1 to be a leaf tensor after being transfered to a different device.

Note that for your specific case we could also just make use of the device, dtype, and requires_grad arguments for torch.randn and simply do

W1 = torch.randn(embedding_dims, vocabulary_size, dtype=torch.float, device=device, requires_grad=True)

Most pytorch functions which initialize new tensors support these three arguments which can help avoid the issues you've encountered.

To respond to OP's additional question in the comments:

Is there a good spot I would've come across this in the documentation?

AFAIK the documentation doesn't specifically address this issue. It's kind of a combination between how variables in python work and how autograd mechanics work in pytorch.

Assuming you have a good understand of variables in python then you can reach the conclusions of this answer yourself by first reading Tensor.is_leaf, in particular

they will be leaf Tensors if they were created by the user. This means that they are not the result of an operation and so grad_fn is None.

And furthermore the documentation for Tensor.to which states

If the self Tensor already has the correct torch.dtype and torch.device, then self is returned. Otherwise, the returned tensor is a copy of self with the desired torch.dtype and torch.device.

Since Tensor.to returns a copy, and a copy is an operation, then it should be clear from the documentation that the W1 tensor in the original code is not a leaf tensor.

Thank you! Is there a good spot I would've come across this in the documentation? I looked but I don't think I found anything this clear. Perhaps I wasn't looking in the right place. Appreciate the anwer.
@Jibril I added an additional discussion addressing your question.

Collectives™ on Stack Overflow

PyTorch Gradient not Working after adding to("cuda:0")

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related