Significant time difference to transfer tensor from gpu to cpu between 2 models

I have a Python script running two models on CUDA. One model is larger than the other, but their inference times are quite similar (~3ms vs. ~9ms). The issue arises when I move the output tensors from the GPU back to the CPU. The output tensor of the smaller model takes around 15ms to transfer, while the output tensor of the larger model takes ~500ms—even though the larger model’s output is actually smaller than that of the smaller model.

I find this behavior puzzling. Does anyone have an explanation?

You are adding a synchronization to the code and the host will wait until the transfer is done. All previously scheduled CUDA kernels will still be executed and the host timer will accumulate their time unless you explicitly synchronize the code before starting and stopping host timers.