Tensor to cpu operation is bottleneck in forward pass

I am using scipy’s linear_sum_assignment after my forward pass, before computing loss. (as done in DeTR - object detection with transformers’ github codebase.
To use the scipy function, tensor needs to be transferred to cpu. That operation is taking a significant time (as expected), in between training. Is there a recommended good practice to transfer tensor to cpu, to optimise the overall time taken?
In DeTR 's usage of cpu() (above), they have put the torch.no_grad() decorator over the function comprising cpu() , I tried doing the same, but did not observe any improvement in time consumption.

Hi Pranjal!

I think that this is the best you can do. To use scipy, as you recognize,
your tensor must be on the cpu and it takes unavoidable time to move
it there.

Nonetheless, some comments:

You could hypothetically implement “linear_sum_assignment” yourself
on the gpu. (This would almost certainly not be worth the considerable
effort.) Even if the gpu were not particularly well adapted to that algorithm,
it probably wouldn’t be dramatically slower than the cpu, and you would
save the non-trivial time of moving tensors back and forth.

You could hypothetically have your gpu performing some other useful
computation internally while it’s moving the tensor to the cpu. But guessing
that your workflow looks like forward pass, linear_sum_assignment, loss
computation, backward pass, I don’t see what other useful work would
be available for your gpu during the gpu / cpu data transfer.

If your model (and / or input data and batch size) is very small, so that you
aren’t really getting significant benefit from the gpu, running your entire
training loop on the cpu could be faster. But this would seem unlikely for
any realistic use case.

Good luck.

K. Frank

1 Like

Hello Pranjal,
I totally agree with @KFrank, usually it is not worth to recode a function like that in PyTorch. But if you want to, take a look at LSAP. If you want to see a C or C++ “unhidden” implementation I would search for “Jonker-Volgenant”.
Best,
nwn

2 Likes

Also, you can check this for a GPU + CPU implementation of LAP using auction algorithm,

More LAP implementation can be found at:

and