I’ve migrated to PyTorch from Chainer for the library of deep learning,
and found PyTorch is a little slower than Chainer at test time with convolutional networks.
I’ve noticed this when implementing convolutional networks for segmentation:
==> Running on GPU: 0 to evaluate 1000 times
==> Testing FCN32s with Chainer
Elapsed time: 52.03 [s / 1000 evals]
Hz: 19.22 [hz]
==> Testing FCN32s with PyTorch
Elapsed time: 58.78 [s / 1000 evals]
Hz: 17.01 [hz]
I expected that PyTorch is faster than Chainer,
because it use C extension to make computations faster in most functional implementations.
Is this a known result?
I’ve checked convnet-benchmarks, but couldn’t find result of PyTorch.
(I tried with torch.backends.cudnn.benchmark = True and it shows ~22Hz in PyTorch, but I heard it limits input tensor size, and not same condition with Chainer.)
So I think there are at least two differences between the networks at the moment:
You never set the PyTorch model to eval() mode, so you pay additional cost for the Dropout layers (they’re no-ops at eval time, but not at training time).
We’re not using cuDNN for MaxPooling.
I only quickly glanced over the scripts so there might be more.
Benchmark mode doesn’t limit the input size in any way, but it should be used only if you’ll be using a (small) number of input sizes. The benchmarks will be run for every different shape, so if your input wildly varies you might be running them at each iteration. If you train the FCN on a pre-processed dataset where all images are of the same size, use the benchmark mode. If every image is of different size, don’t use it.
I looked into the issue and it’s a problem with our code that chooses the cuDNN algorithms. PyTorch is faster at first, but then cuDNN asks for 17GB of mem, and we just fall back to the slowest algo because we can’t satisfy that. It should be fixed soon. Thanks for the report and code that reproduces it!
Also, it seems that the dynamic option in your code only tries 2 different shapes, but in such conditions benchmark can be used as well. It’s only a problem if there are lots of possible input sizes (say >10), because it will find different algorithms for each size. If you only have 2 shapes then it will only benchmark twice