Setting requires_grad_ on gpu

BharathC · February 6, 2020, 2:07am

Hi all,

Apologies if this question has been asked already. I searched through the forums and found posts related to my question, but they do not exactly answer my question. It is a very basic question on creating a tensor that requires gradients and then moving it to the GPU. This is my code.

import torch

device='cuda'
a = torch.randn(3, dtype=torch.float, requires_grad = True).to(device)
b = torch.randn(3, dtype=torch.float).to(device)

# Post facto set gradients

a.requires_grad_()
b.requires_grad_()
print("a is ",a)
print("b is ",b)

loss1 = a.sum()

loss2 = b.sum()

loss1.backward()

loss2.backward()

print("Gradient wrt to a is ",a.grad)
print("Gradient wrt to b is ",b.grad)

The output is

a is  tensor([ 1.1719, -0.8410, -1.6699], device='cuda:0', grad_fn=<CopyBackwards>)
b is  tensor([ 0.8113,  0.6762, -0.0123], device='cuda:0', requires_grad=True)
Gradient wrt to a is  None
Gradient wrt to b is  tensor([1., 1., 1.], device='cuda:0')

It looks like if the cpu tensor is created with “requires_grad = True”, the corresponding GPU tensor does not have that property any more but rather has the “grad_fn=” property. If I use this tensor “a” subsequently to compute a loss, then a.grad is None after calling loss.backward(). Whereas b.grad() returns the gradients with respect to b. Is this the desired behavior. If so, could anyone explain why?

I am aware that I can pass the device = ‘cuda’ option directly to the torch.randn function and avoid this problem totally. But for a simple experiment, I wanted to get identical results on the cpu and gpu. The cpu and gpu use different random number generators and that is why I am following this approach. From the discussion here, I understand that the .to() call creates a new tensor. However, it does not explain the behavior that I am seeing above. Any pointers would be highly appreciated.

Thanks

ptrblck · February 6, 2020, 2:58am

Have a look at this post which explains this use case quite well.

BharathC · February 6, 2020, 5:27am

Thank you very much! Could you be kind enough to verify if my understanding is correct?
From the example in the link you provided -

a = torch.rand(10, requires_grad=True).double() # a is NOT a leaf variable as it was created by the operation that cast a float tensor into a double tensor

a = torch.rand(10).double().requires_grad_() # a requires gradients and has no operations creating it: it's a leaf variable and can be given to an optimizer.

So, this means autograd starts tracking the tensor only when requires_grad is set to True. Is that correct?

Thanks a lot

ptrblck · February 6, 2020, 6:28am

Autograd will track all operations, where tensors are involved which require gradients.
However, the .grad attribute will be populated for leaf variables by default.

Since a was created by the double operation, you won’t see the .grad attribute populated by default using this dummy example:

a = torch.rand(10, requires_grad=True).double()
a.mean().backward()
print(a.grad)
> UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

However, the operation is still tracked and differentiable. If you call a.retain_grad(), you will be able to see the gradient on this intermediate tensor:

a = torch.rand(10, requires_grad=True).double()
a.retain_grad()
a.mean().backward()
print(a.grad)

BharathC · February 7, 2020, 5:34am

Thank you very much. That was very helpful. And I just wanted to mention that although this is the first time I have asked a question on the forum, your answers in many other posts have helped me a lot with my research. I just wanted to express my appreciation for that!