Tensorflow vs. PyTorch ConvNet benchmark

Cuda kernels are async by default, that means that as soon as the cuda kernel is launched, the program get the control back even though computation is not finished.

To make sure computation is finished you should call “cudaDeviceSynchronize”, your best bet would be through Cupy.

Furthermore there might be a difference due to the Tensor layouts:

PyTorch use NCHW and Tensorflow uses NHWC, NCHW was the first layout supported by CuDNN but presents a big challenge for optimization (due to access patterns in convolutions, memory coalescing and such …).
NHWC is easier to optimize for convolutions but suffer in linear layers iirc because you have to physically transpose/permute the dimensions.

Furthermore, due to it’s dynamic nature, PyTorch allocate new memory at each new batch while Tensorflow can just reuse previous memory locations since size is known in advance.

Memory is THE bottleneck in Deep Learning not CPU, the big challenge is how to feed data fast enough to the CPU and GPU to get the maximum GFLOPS throughput.

So I think the benchmark is worth it.