Forward/Backward partially on cpu and partially on gpu?

Is it possible to forward a model on gpu but calculate the loss of the last layer on cpu?
If so, how does pytorch know during backprop where the tensor is? Or is it expecting all tensors to lie consistently on one device?

If it is possible, is there a documentation article or other resource which explains this process?

Background: I calculate a loss with torch.pca_lowrank which is significantly faster on cpu in my usecase (more than 8x).

Cheers!

Yes, you can move around activations, as Autograd will track the to()/cpu()/cuda() operations.
E.g.:

# setup
x_cuda = torch.randn(1, 1).to('cuda')
lin_cuda = nn.Linear(1, 1).to('cuda')
lin_cpu = nn.Linear(1, 1)

# workload on the GPU
out = lin_cuda(x_cuda)
# transfer to CPU
out_cpu = out.to('cpu')
# workload on CPU
out_cpu = lin_cpu(out_cpu)

# loss calculation on the CPU
loss = (out_cpu ** 2).mean()

# backward
loss.backward()

# check grads
for name, param in lin_cuda.named_parameters():
    print(name, param.grad)
> weight tensor([[-0.2126]], device='cuda:0')
  bias tensor([-0.4568], device='cuda:0')

for name, param in lin_cpu.named_parameters():
    print(name, param.grad)
> weight tensor([[-0.7824]])
  bias tensor([0.8254])

will work.

1 Like