Speed benchmark on VGG16

You need to add torch.cuda.synchronize() before measuring the end time. Python code executes asynchronously with the GPU kernels, so if you want to correctly measure the time, you need to sync manually.

Also, you’re measuring the time of CUDA copies, but using pinned memory and async transfers would be faster than the simplest approach.

4 Likes