Tensor.grad is None if using CUDA device

I’m trying to compute derivatives of functions in a TensorFlow-like fashion. Consider the following code:

import torch
x = torch.linspace(-10., 10, 10000, requires_grad=True)
y = x**2
y.backward(torch.ones_like(x))
g = x.grad

It works just fine, with g being the gradient, such that f(x) = x**2 and f'(x) = 2x.
But if I set the device to my GPU by using x = x.cuda(), it does not work, giving the warning:

<stdin>:1: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at C:\cb\pytorch_1000000000000\work\build\aten\src\ATen/core/TensorBody.h:491.)

And then g is None. Why is that? How can I fix it?

The .to() operation is differentiable and will create a non-leaf tensor as the warning explains.
Either specify the device during the creation:

x = torch.linspace(-10., 10, 10000, device="cuda", requires_grad=True)
y = x**2
y.backward(torch.ones_like(x))
g = x.grad

or create a new leaf tensor:

x = torch.linspace(-10., 10, 10000, requires_grad=True)
x = x.to("cuda")
x.detach_().requires_grad_(True)
y = x**2
y.backward(torch.ones_like(x))
g = x.grad

It worked now, thanks!
It seems very counter-intuitive to me, why would the Tensor.to function be differentiable? It simply copying a block of memory, no operation is being performed. Or I’m thinking this the wrong way?

Because it allows you to use different devices in the forward and backward pass, which is a fantastic feature, to e.g. CPU-offload, for model sharding, increase the numerical precision for some ops by casting to e.g. float64, etc.