I have a vector a produced by neural model which need to interact with a huge matrix M. Since M is large, I have to do the computation in cpu device. In this case, I wonder if the gradient can be retained and backwarded on cuda device.

Below is an example. I am looking for solution such that a_cuda.grad has the same gradients as a_cpu.grad.

a_cuda = torch.randn([1, 512], requires_grad=True).to("cuda")
a_cpu = torch.randn([1, 512], requires_grad=True).to("cpu")
M = torch.randn([512, 100000], requires_grad=False) # loaded in cpu device, dont need update
out_cuda = (a_cuda.cpu() @ M).sum()
out_cuda.backward()
out_cpu = (a_cpu @ M).sum()
out_cpu.backward()
print(a_cuda.grad) # None
print(a_cpu.grad)

UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

which is why a_cuda.grad is None.
Fix it by creating a_cuda on the GPU directly: a_cuda = torch.randn([1, 512], requires_grad=True, device="cuda") and itâ€™ll work.

Thanks for the response. Since a_cuda is an output tensor from a model (i.e. a_cuda=model(input)), I fail to find a way to creat it on GPU directly. Is there is any other workaround? Thanks.