Pytorch performance

I’ve been recently doing some benchmarking comparing the performance of pytorch, theano and tensorflow. Here is what I have found:

  • for small conv nets (e.g., 96x96, f=64;k=3;s=1 f=128;k=3;s=2 f=256;k=3;s=2 512 16, bs=128) all frameworks have roughly the same performance (±20%). Pytorch has usually the quickest forward pass and the roughly equal backprop.
  • for larger conv nets (e.g., 96x96, f=64;k=3;s=1 f=128;k=3;s=2 f=256;k=3;s=1 f=256;k=3;s=1 f=256;k=3;s=1 f=256;k=3;s=1 512 512 16 bs=128) Tensorflow is quicker of forward pass (ca. 10-30%) and much quicker (even 80%) on backprop.

I checked that on Python 3.6, Cuda 8.0, Cudnn 5.1, Ubuntu 16.04 with both Titan X and 1080 Ti.

Has anybody a similar experience?

for larger convnets, use the flag: torch.backends.cudnn.benchmark=True, which helps. For example:


Thank you! This improved the performance significantly. Now pytorch is in par with tensorflow (max 15% slower for some models).

Is there any reason that the default value of torch.backends.cudnn.benchmark is False instead of True?

It takes more memory and requires a benchmark phase which can be costly if
you change the computation graph often.


However, the computation graph is built dynamically anyway in PyTorch. Why changing the computation graph often would cause a problem?

In benchmark mode, for each input size, cudnn will perform a bunch of computations to infer the fastest algorithm for that specific case, and caches the result. This brings some overhead, and if your input dimensions change all the time, using benchmark will actually slow down things because of this overhead.


Would not it be better to set benchmark=True by default and heuristically turn it off in case too many cache misses?

Not sure it would be better to come up with some heuristics. Maybe just better document the benchmark option?
But this is something that might change in the future, as for the moment pytorch doesn’t give a way to choose which algorithms to use with cudnn.

Are there any benchmarks between Torch and PyTorch? I am curious if the performance is the same, or which sort of differences are inherent in the two platforms.


Nick (new Torch/PyTorch fan!)

speed is the same, memory usage is much lower.


when I do cudnn.benchmark = True in my test program.
it warns me "RuntimeError: CUDNN_STATUS_INTERNAL_ERROR"
how can I resolve this problem.