Copy tensor from cuda to cpu is too slow

You have to add torch.cuda.synchronize() to your benchmark, since the GPU operations are executed asynchronously (see here).

Your model is probably not finished, so that the transfer of output has to wait for it.

3 Likes