Speed benchmark on VGG16

tdeboissiere · February 12, 2017, 5:45am

I am testing pytorch’s speed on a simple VGG16 benchmark and I have noticed the following timings:

Iteration: 0 train on batch time: 414.968 ms. (warm up)
Iteration: 1 train on batch time: 274.113 ms. (much faster than warm up and subsequent iterations)
Iteration: 2 train on batch time: 377.266 ms. (from now on, pretty much constant times)
Iteration: 3 train on batch time: 386.689 ms.
Iteration: 4 train on batch time: 385.500 ms.
Iteration: 5 train on batch time: 385.082 ms.
Iteration: 6 train on batch time: 385.090 ms.

Do these timings sound reasonable and is there a reason why Iteration 1 is much faster than the rest ?

System specs:

CUDA 8
CUDNN 5.1
Maxwell Titan X
pytorch installed with conda (0.1.8-py27_1cu80)

Thanks !

apaszke · February 12, 2017, 1:29pm

You need to add torch.cuda.synchronize() before measuring the end time. Python code executes asynchronously with the GPU kernels, so if you want to correctly measure the time, you need to sync manually.

Also, you’re measuring the time of CUDA copies, but using pinned memory and async transfers would be faster than the simplest approach.

tdeboissiere · February 13, 2017, 12:20am

Thanks! Can you point me to the documentation to re-implement the benchmark with pinned memory and async transfers ?

apaszke · February 13, 2017, 1:00am

You can find the docs for pin_memory() here. For async copies just call .cuda(async=True) (but remember it is a no-op on non-pinned memory).