Forward/Backward partially on cpu and partially on gpu?

Is it possible to forward a model on gpu but calculate the loss of the last layer on cpu?
If so, how does pytorch know during backprop where the tensor is? Or is it expecting all tensors to lie consistently on one device?

If it is possible, is there a documentation article or other resource which explains this process?

Background: I calculate a loss with torch.pca_lowrank which is significantly faster on cpu in my usecase (more than 8x).

Cheers!

Yes, you can move around activations, as Autograd will track the to()/cpu()/cuda() operations.
E.g.:

# setup
x_cuda = torch.randn(1, 1).to('cuda')
lin_cuda = nn.Linear(1, 1).to('cuda')
lin_cpu = nn.Linear(1, 1)

# workload on the GPU
out = lin_cuda(x_cuda)
# transfer to CPU
out_cpu = out.to('cpu')
# workload on CPU
out_cpu = lin_cpu(out_cpu)

# loss calculation on the CPU
loss = (out_cpu ** 2).mean()

# backward
loss.backward()

# check grads
for name, param in lin_cuda.named_parameters():
    print(name, param.grad)
> weight tensor([[-0.2126]], device='cuda:0')
  bias tensor([-0.4568], device='cuda:0')

for name, param in lin_cpu.named_parameters():
    print(name, param.grad)
> weight tensor([[-0.7824]])
  bias tensor([0.8254])

will work.

1 Like

@ptrblck , Does it make any difference to bring the loss factors to the same device as the model, in terms of performance and memory?

For example in a RL setup, The logprob in experience buffer which is normally stored on cpu, participates in loss computation while the model output logprob is on gpu. Is it better to move the former to gpu or the latter to cpu?

Assuming you want to keep these two outputs on different devices (and won’t be able to move everything to the GPU), you would need to compare the data transfer time against the loss calculation.
If the loss calculation is “expensive” you might want to move the CPUTensor to the GPU before computing the loss. You can also transfer the CPUTensor asynchronously and could hide the transfer time if other operations can be executed before the loss calculation is needed.

However, if the loss computation is really cheap and if the CUDATensor is small you could transfer it to the CPU instead, but should definitely profile it first.