PyTorch vs using cuDNN (in C++)

I’m trying use cuDNN directly in C++ (for various reasons). My code works just fine as in it compiles, loss goes down, accuracy goes on MNIST as expected but it is awfully slower than using PyTorch in Python. PyTorch is somehow WAY faster than directly using cuDNN in C++ even when I load the whole MNIST dataset into GPU to avoid host<–>memory transfers and set convolution type to CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.
My question is what extra stuff is PyTorch doing that makes it so much faster? Is it a a whole bunch of small stuff that adds up to a gigantic speed up or do the 80% gains come from some small 20% things? Is there a place where I can look for some of these things?

You can take a look at the cudnn bindings and in particular into Conv_v7.cpp for the legacy API and Conv_v8.cpp for the new frontend API usage.

1 Like

Thanks for the link! They both look very different and Conv_v7.cpp seems to have more stuff going on in it.
Anyway, the main thing I got out of Conv_v7.cpp is that PyTorch is doing at least these 2 things:

  1. Some kind of 32bit split across the batch dimension. I didn’t get what it is but since I’m working with giga-pixeled images (and hence my own cuDNN code), my batch size is one so I guess I can ignore it.
  2. Instead of setting cudnnSetConvolutionMathType to something and then calling cudnnFindConvolutionForwardAlgorithm, PyTorch tries all possible MathType with cudnnFindConvolutionForwardAlgorithmEx (and not cudnnFindConvolutionForwardAlgorithm) to find the fastest combination.
    Is that correct? And did I miss something that’s very important for performance?
  1. This is used to work around the 32bit limitation of cuDNN for large inputs and we are splitting the workload in the batch size if possible.

  2. cudnnFind is used when torch.backends.cudnn.benchmark = True is used.

1 Like