Apologies if this question has been asked already. I searched through the forums and found posts related to my question, but they do not exactly answer my question. It is a very basic question on creating a tensor that requires gradients and then moving it to the GPU. This is my code.

import torch
device='cuda'
a = torch.randn(3, dtype=torch.float, requires_grad = True).to(device)
b = torch.randn(3, dtype=torch.float).to(device)
# Post facto set gradients
a.requires_grad_()
b.requires_grad_()
print("a is ",a)
print("b is ",b)
loss1 = a.sum()
loss2 = b.sum()
loss1.backward()
loss2.backward()
print("Gradient wrt to a is ",a.grad)
print("Gradient wrt to b is ",b.grad)

The output is

a is tensor([ 1.1719, -0.8410, -1.6699], device='cuda:0', grad_fn=<CopyBackwards>)
b is tensor([ 0.8113, 0.6762, -0.0123], device='cuda:0', requires_grad=True)
Gradient wrt to a is None
Gradient wrt to b is tensor([1., 1., 1.], device='cuda:0')

It looks like if the cpu tensor is created with “requires_grad = True”, the corresponding GPU tensor does not have that property any more but rather has the “grad_fn=” property. If I use this tensor “a” subsequently to compute a loss, then a.grad is None after calling loss.backward(). Whereas b.grad() returns the gradients with respect to b. Is this the desired behavior. If so, could anyone explain why?

I am aware that I can pass the device = ‘cuda’ option directly to the torch.randn function and avoid this problem totally. But for a simple experiment, I wanted to get identical results on the cpu and gpu. The cpu and gpu use different random number generators and that is why I am following this approach. From the discussion here, I understand that the .to() call creates a new tensor. However, it does not explain the behavior that I am seeing above. Any pointers would be highly appreciated.

Thank you very much! Could you be kind enough to verify if my understanding is correct?
From the example in the link you provided -

a = torch.rand(10, requires_grad=True).double() # a is NOT a leaf variable as it was created by the operation that cast a float tensor into a double tensor
a = torch.rand(10).double().requires_grad_() # a requires gradients and has no operations creating it: it's a leaf variable and can be given to an optimizer.

So, this means autograd starts tracking the tensor only when requires_grad is set to True. Is that correct?

Autograd will track all operations, where tensors are involved which require gradients.
However, the .grad attribute will be populated for leaf variables by default.

Since a was created by the double operation, you won’t see the .grad attribute populated by default using this dummy example:

a = torch.rand(10, requires_grad=True).double()
a.mean().backward()
print(a.grad)
> UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

However, the operation is still tracked and differentiable. If you call a.retain_grad(), you will be able to see the gradient on this intermediate tensor:

a = torch.rand(10, requires_grad=True).double()
a.retain_grad()
a.mean().backward()
print(a.grad)

Thank you very much. That was very helpful. And I just wanted to mention that although this is the first time I have asked a question on the forum, your answers in many other posts have helped me a lot with my research. I just wanted to express my appreciation for that!