No gradient on cuda?

It seems I can’t get a gradient when sending tensors to cuda:

import torch
print(torch.__version__)

x = torch.tensor(4.2, requires_grad=True).cuda()
y = torch.tensor(5.2, requires_grad=True).cuda()
output = x * y
output.backward()

print(output)
print(x.grad)
print(y.grad)

and the output is:

1.10.1
tensor(21.8400, device='cuda:0', grad_fn=<MulBackward0>)
None
None
C:\Users\**\anaconda3\lib\site-packages\torch\_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  aten\src\ATen/core/TensorBody.h:417.)
  return self._grad

obviously the tensors x and y are leaf tensors, how come their gradients are not calculated? am I missing something? Thanks for any advice!

That’s not the case, since cuda() is a differentiable operation as seen here:

x = torch.tensor(4.2, requires_grad=True)
print(x.is_leaf)
# > True

x = torch.tensor(4.2, requires_grad=True).cuda()
print(x.is_leaf)
# > False
print(x.grad_fn)
# > <ToCopyBackward0 object at 0x7f2aa29adcd0>

If you want to create a trainable tensor on the GPU, either use the device argument in its initialization or call detach().requires_grad_() on it.

Thanks for your reply, that’s interesting to know! May I please ask how come cuda() is differentiable?? I thought it is just used to send data from cpu to gpu?

It should be differentiable in the same way that nn.Identity is differentiable; it introduces another tensor in the computation graph even if it does not apply a transformation to the input to produce an output.

Additionally to what @eqy said: it allows you to use different devices without detaching the computation graph. E.g. you could use CPU operations, push the data to the GPU, perform more operations etc. as seen here:

# setup
x = torch.randn(1, 1, requires_grad=True)
lin_on_cpu = nn.Linear(1, 1)
lin_on_gpu = nn.Linear(1, 1).cuda()

# forward
out_on_cpu = lin_on_cpu(x)
out_on_gpu = out_on_cpu.to('cuda')
out_on_gpu = lin_on_gpu(out_on_gpu)

# backward
out_on_gpu.mean().backward()
print(x.grad) # gives a valid gradient
# > tensor([[0.0911]])

and can be used for e.g. model sharing using multiple GPUs etc.