Tensorflow vs. PyTorch ConvNet benchmark

I created a benchmark to compare the performances of Tensorflow and PyTorch for fully convolutional neural networks in this github repository:

I need to make sure if these two implementations are identical. If it is, then the results show that Tensorflow is about %5 faster in one of the experiments and about %20 faster in another experiment.

Thanks in advance

Your benchmarking code isn’t accurate. CUDA kernels are launched asynchronously, but you’re measuring the time using time.time() even though the forward pass won’t have completed at that point.

To my knowledge, both PyTorch and TensorFlow use same implementation (cuDNN) for convolutions. The speed of the convolution implementation is going to dominate if you have sufficient batch size (which you do). The other differences you see are probably differences in measurement (see point above) and how you’re passing in data.

There’s already a bunch of accurate benchmarks of CNNs:


‌But the python API is synchronous, otherwise, it won’t work at all. would you please explain the asynchronous API more clearly?

Afterall I measure the time of 10 iterations.

Cuda kernels are async by default, that means that as soon as the cuda kernel is launched, the program get the control back even though computation is not finished.

To make sure computation is finished you should call “cudaDeviceSynchronize”, your best bet would be through Cupy.

Furthermore there might be a difference due to the Tensor layouts:

PyTorch use NCHW and Tensorflow uses NHWC, NCHW was the first layout supported by CuDNN but presents a big challenge for optimization (due to access patterns in convolutions, memory coalescing and such …).
NHWC is easier to optimize for convolutions but suffer in linear layers iirc because you have to physically transpose/permute the dimensions.

Furthermore, due to it’s dynamic nature, PyTorch allocate new memory at each new batch while Tensorflow can just reuse previous memory locations since size is known in advance.

Memory is THE bottleneck in Deep Learning not CPU, the big challenge is how to feed data fast enough to the CPU and GPU to get the maximum GFLOPS throughput.

So I think the benchmark is worth it.


I know that that tensorflow has the tf_cnn_benchmarks script. Does PyTorch have anything similar?