Speed benchmark on VGG16

I am testing pytorch’s speed on a simple VGG16 benchmark and I have noticed the following timings:

Gist: VGG16 benchmark

Iteration: 0 train on batch time: 414.968 ms. (warm up)
Iteration: 1 train on batch time: 274.113 ms. (much faster than warm up and subsequent iterations)
Iteration: 2 train on batch time: 377.266 ms. (from now on, pretty much constant times)
Iteration: 3 train on batch time: 386.689 ms.
Iteration: 4 train on batch time: 385.500 ms.
Iteration: 5 train on batch time: 385.082 ms.
Iteration: 6 train on batch time: 385.090 ms.

Do these timings sound reasonable and is there a reason why Iteration 1 is much faster than the rest ?

System specs:

CUDA 8
CUDNN 5.1
Maxwell Titan X
pytorch installed with conda (0.1.8-py27_1cu80)

Thanks !

You need to add torch.cuda.synchronize() before measuring the end time. Python code executes asynchronously with the GPU kernels, so if you want to correctly measure the time, you need to sync manually.

Also, you’re measuring the time of CUDA copies, but using pinned memory and async transfers would be faster than the simplest approach.

4 Likes

Thanks! Can you point me to the documentation to re-implement the benchmark with pinned memory and async transfers ?

You can find the docs for pin_memory() here. For async copies just call .cuda(async=True) (but remember it is a no-op on non-pinned memory).

2 Likes