Inference speed: PyTorch vs Chainer (Chainer is faster for convolution?)

I’ve migrated to PyTorch from Chainer for the library of deep learning,
and found PyTorch is a little slower than Chainer at test time with convolutional networks.

I’ve noticed this when implementing convolutional networks for segmentation:

% ./
==> Running on GPU: 0 to evaluate 1000 times
==> Testing FCN32s with Chainer
Elapsed time: 52.03 [s / 1000 evals]
Hz: 19.22 [hz]
==> Testing FCN32s with PyTorch
Elapsed time: 58.78 [s / 1000 evals]
Hz: 17.01 [hz]

I expected that PyTorch is faster than Chainer,
because it use C extension to make computations faster in most functional implementations.
Is this a known result?
I’ve checked convnet-benchmarks, but couldn’t find result of PyTorch.
(I tried with torch.backends.cudnn.benchmark = True and it shows ~22Hz in PyTorch, but I heard it limits input tensor size, and not same condition with Chainer.)

Speed Test

PyTorch implementation

Chainer implementation

1 Like

So I think there are at least two differences between the networks at the moment:

  1. You never set the PyTorch model to eval() mode, so you pay additional cost for the Dropout layers (they’re no-ops at eval time, but not at training time).
  2. We’re not using cuDNN for MaxPooling.

I only quickly glanced over the scripts so there might be more.

Benchmark mode doesn’t limit the input size in any way, but it should be used only if you’ll be using a (small) number of input sizes. The benchmarks will be run for every different shape, so if your input wildly varies you might be running them at each iteration. If you train the FCN on a pre-processed dataset where all images are of the same size, use the benchmark mode. If every image is of different size, don’t use it.

1 Like

I missed that, sorry for that, but it wouldn’t change the result so much.

I didn’t know that, thanks for letting me know.
I tested with disabling cudnn for max_pooling in Chainer, but it wouldn’t change the result much.

Current result is below for both dynamic and static input (with cudnn=False in chainer max_pooling):


With input size change at each forwarding

% ./ --dynamic-input
==> Benchmark: gpu=0, times=1000, dynamic_input=True
==> Testing FCN32s with Chainer
Elapsed time: 48.83 [s / 1000 evals]
Hz: 20.48 [hz]
==> Testing FCN32s with PyTorch
Elapsed time: 57.00 [s / 1000 evals]
Hz: 17.55 [hz]


With pytorch cudnn.benchmark=True:

% ./ --gpu 1
==> Benchmark: gpu=1, times=1000, dynamic_input=False
==> Testing FCN32s with Chainer
Elapsed time: 48.98 [s / 1000 evals]
Hz: 20.42 [hz]
==> Testing FCN32s with PyTorch
Elapsed time: 45.15 [s / 1000 evals]
Hz: 22.15 [hz]

Did you also set volatile=True for both frameworks? In each case it should avoid unnecessary graph construction overhead.

Yeah, I set that.

I looked into the issue and it’s a problem with our code that chooses the cuDNN algorithms. PyTorch is faster at first, but then cuDNN asks for 17GB of mem, and we just fall back to the slowest algo because we can’t satisfy that. It should be fixed soon. Thanks for the report and code that reproduces it!

1 Like

Also, it seems that the dynamic option in your code only tries 2 different shapes, but in such conditions benchmark can be used as well. It’s only a problem if there are lots of possible input sizes (say >10), because it will find different algorithms for each size. If you only have 2 shapes then it will only benchmark twice

1 Like

This is now fixed in master. PyTorch times are now the same both in benchmark and regular modes.


Anyone could re run the tests on updated versions of both?