Inference speed: PyTorch vs Chainer (Chainer is faster for convolution?)

wkentaro · April 7, 2017, 12:57pm

I’ve migrated to PyTorch from Chainer for the library of deep learning,
and found PyTorch is a little slower than Chainer at test time with convolutional networks.

I’ve noticed this when implementing convolutional networks for segmentation:

% ./speedtest.py
==> Running on GPU: 0 to evaluate 1000 times
==> Testing FCN32s with Chainer
Elapsed time: 52.03 [s / 1000 evals]
Hz: 19.22 [hz]
==> Testing FCN32s with PyTorch
Elapsed time: 58.78 [s / 1000 evals]
Hz: 17.01 [hz]

I expected that PyTorch is faster than Chainer,
because it use C extension to make computations faster in most functional implementations.
Is this a known result?
I’ve checked convnet-benchmarks, but couldn’t find result of PyTorch.
(I tried with torch.backends.cudnn.benchmark = True and it shows ~22Hz in PyTorch, but I heard it limits input tensor size, and not same condition with Chainer.)

Speed Test

github.com

wkentaro/pytorch-fcn/blob/master/examples/voc/speedtest.py

#!/usr/bin/env python

import argparse
import time

import numpy as np
import six


def bench_chainer(gpu, times, dynamic_input=False):
    import chainer
    import fcn
    print('==> Testing FCN32s with Chainer')
    chainer.cuda.get_device(gpu).use()

    chainer.config.train = False
    chainer.config.enable_backprop = False

    if dynamic_input:
        x_data = np.random.random((1, 3, 480, 640)).astype(np.float32)

This file has been truncated. show original

PyTorch implementation

github.com

wkentaro/pytorch-fcn/blob/master/torchfcn/models/fcn32s.py

import os.path as osp

import fcn
import numpy as np
import torch
import torch.nn as nn


# https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/surgery.py
def get_upsampling_weight(in_channels, out_channels, kernel_size):
    """Make a 2D bilinear kernel suitable for upsampling"""
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = np.ogrid[:kernel_size, :kernel_size]
    filt = (1 - abs(og[0] - center) / factor) * \
           (1 - abs(og[1] - center) / factor)
    weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size),

This file has been truncated. show original

Chainer implementation

github.com

wkentaro/fcn/blob/master/fcn/models/fcn32s.py

import os.path as osp

import chainer
import chainer.functions as F
import chainer.links as L
import numpy as np

from .. import data
from .. import initializers


class FCN32s(chainer.Chain):

    pretrained_model = osp.expanduser(
        '~/data/models/chainer/fcn32s_from_caffe.npz')

    def __init__(self, n_class=21):
        self.n_class = n_class
        kwargs = {
            'initialW': chainer.initializers.Zero(),

This file has been truncated. show original

apaszke · April 9, 2017, 10:33am

So I think there are at least two differences between the networks at the moment:

You never set the PyTorch model to eval() mode, so you pay additional cost for the Dropout layers (they’re no-ops at eval time, but not at training time).
We’re not using cuDNN for MaxPooling.

I only quickly glanced over the scripts so there might be more.

Benchmark mode doesn’t limit the input size in any way, but it should be used only if you’ll be using a (small) number of input sizes. The benchmarks will be run for every different shape, so if your input wildly varies you might be running them at each iteration. If you train the FCN on a pre-processed dataset where all images are of the same size, use the benchmark mode. If every image is of different size, don’t use it.

wkentaro · April 11, 2017, 6:57pm

I missed that, sorry for that, but it wouldn’t change the result so much.

I didn’t know that, thanks for letting me know.
I tested with disabling cudnn for max_pooling in Chainer, but it wouldn’t change the result much.

Current result is below for both dynamic and static input (with cudnn=False in chainer max_pooling):

Dynamic

With input size change at each forwarding

% ./speedtest.py --dynamic-input
==> Benchmark: gpu=0, times=1000, dynamic_input=True
==> Testing FCN32s with Chainer
Elapsed time: 48.83 [s / 1000 evals]
Hz: 20.48 [hz]
==> Testing FCN32s with PyTorch
Elapsed time: 57.00 [s / 1000 evals]
Hz: 17.55 [hz]

Static

With pytorch cudnn.benchmark=True:

% ./speedtest.py --gpu 1
==> Benchmark: gpu=1, times=1000, dynamic_input=False
==> Testing FCN32s with Chainer
Elapsed time: 48.98 [s / 1000 evals]
Hz: 20.42 [hz]
==> Testing FCN32s with PyTorch
Elapsed time: 45.15 [s / 1000 evals]
Hz: 22.15 [hz]

jekbradbury · April 11, 2017, 7:37pm

Did you also set volatile=True for both frameworks? In each case it should avoid unnecessary graph construction overhead.

wkentaro · April 12, 2017, 9:35am

Yeah, I set that.
Chainer: https://github.com/wkentaro/pytorch-fcn/blob/master/examples/voc/speedtest.py#L25
PyTorch: https://github.com/wkentaro/pytorch-fcn/blob/master/examples/voc/speedtest.py#L71

apaszke · April 18, 2017, 9:18pm

I looked into the issue and it’s a problem with our code that chooses the cuDNN algorithms. PyTorch is faster at first, but then cuDNN asks for 17GB of mem, and we just fall back to the slowest algo because we can’t satisfy that. It should be fixed soon. Thanks for the report and code that reproduces it!

apaszke · April 18, 2017, 9:21pm

Also, it seems that the dynamic option in your code only tries 2 different shapes, but in such conditions benchmark can be used as well. It’s only a problem if there are lots of possible input sizes (say >10), because it will find different algorithms for each size. If you only have 2 shapes then it will only benchmark twice

apaszke · April 19, 2017, 9:51am

This is now fixed in master. PyTorch times are now the same both in benchmark and regular modes.

Royi · December 17, 2017, 6:59pm

Anyone could re run the tests on updated versions of both?