CopyBackward between devices

If there is some tensor on GPU that requires_grad but is later copied to CPU the grad_fn for the CopyBackwards would set the src_device to be GPU so during the backwards pass would this result in the tensor being copied back to the GPU?

If so, is there a way to disable the second copy back to GPU?

Thanks!

Yes it will.
The reason is that a Tensor and its gradient are always on the same device.

class Dummy(nn.Module):
    def __init__(self, input_shape):
        super(Dummy, self).__init__()
        self.parameter = nn.Parameter(torch.randn(*input_shape), requires_grad=True)

    def forward(self, x):
        return x + self.parameter
        

model = Dummy((1, 2, 3)).to('cuda:0')
input = torch.randn((1, 2, 3))
input = input.to('cuda:0')
output = model(input)
output = output.to('cpu')

model.to('cpu')
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

loss = loss_fn(output, golden_output)

optimizer.zero_grad()
loss.backward()
optimizer.step()

I know the above example is pretty useless but since the module has no intermediate outputs, when the output + parameters are moved to cpu wouldn’t the backwards pass be able to run on CPU.

In this way, if this was a larger module with more parameters and gradient outputs that are moved to a different device, wouldn’t the backwards pass be able to run there?

From my understanding of Loss.backward() throws an error with multi gpus as long as the tensors are moved to the appropriate device the operations should be able to run.

In general, it is very complex to say if it can or cannot work.
The basic assumption we make is that since the backward pass is very similar to the forward pass, running the backward where the forward happened is a good idea.
And this is why all the backward pass of the op that goes from gpu -> cpu is actually a function that goes from cpu -> gpu.

I see, thanks for the clarification!