Tensor.item() takes a lot of running time

Hi,

The point above is that item() looks like it’s taking a lot of time because it causes syncronization of your gpu.
But the item call itself is not what takes time, it’s the rest of the operations that are running on the gpu.
That’s what you see when using CUDA_LAUNCH_BLOCKING=1 where you force each operation to be synchronous and thus nothing is left to be done when you call item and it executes quickly.
The behaviour of .data[0] that you see is because it delays sync even more (somewhere in the dataloader).

As you would expect the total runtime is always the same as these operations don’t actually take any time, they just change where and how the cuda sync happens.
Note that if you want to profile runtime of the cuda ops, you want to set CUDA_LAUNCH_BLOCKING=1 so that you measure each operation runtime, not the sync points at the end.

3 Likes