Copy tensor from cuda to cpu is too slow

CUDA operations are executed asynchronously so you would need to synchronize the code before starting and stopping the timer via torch.cuda.synchronize(). The cpu() operation will synchronize the code in your example so that the printed time might yield the model execution + data transfer.


It helps a lot, thanks.