I’ve been recently doing some benchmarking comparing the performance of pytorch, theano and tensorflow. Here is what I have found:
for small conv nets (e.g., 96x96, f=64;k=3;s=1 f=128;k=3;s=2 f=256;k=3;s=2 512 16, bs=128) all frameworks have roughly the same performance (±20%). Pytorch has usually the quickest forward pass and the roughly equal backprop.
for larger conv nets (e.g., 96x96, f=64;k=3;s=1 f=128;k=3;s=2 f=256;k=3;s=1 f=256;k=3;s=1 f=256;k=3;s=1 f=256;k=3;s=1 512 512 16 bs=128) Tensorflow is quicker of forward pass (ca. 10-30%) and much quicker (even 80%) on backprop.
I checked that on Python 3.6, Cuda 8.0, Cudnn 5.1, Ubuntu 16.04 with both Titan X and 1080 Ti.
In benchmark mode, for each input size, cudnn will perform a bunch of computations to infer the fastest algorithm for that specific case, and caches the result. This brings some overhead, and if your input dimensions change all the time, using benchmark will actually slow down things because of this overhead.
Not sure it would be better to come up with some heuristics. Maybe just better document the benchmark option?
But this is something that might change in the future, as for the moment pytorch doesn’t give a way to choose which algorithms to use with cudnn.