I have a Python script running two models on CUDA. One model is larger than the other, but their inference times are quite similar (~3ms vs. ~9ms). The issue arises when I move the output tensors from the GPU back to the CPU. The output tensor of the smaller model takes around 15ms to transfer, while the output tensor of the larger model takes ~500ms—even though the larger model’s output is actually smaller than that of the smaller model.
I find this behavior puzzling. Does anyone have an explanation?