I tried running some LSTMs manually using cuDNN by calling directly out to the library. Everything seems to work great, except the performance isn’t what I would expect when comparing against libtorch: basically my direct-to-cudnn code is slower by a factor of two compared to the identical calculation as performed through libtorch.
I was under the impression that if libtorch claims to have access to cuDNN features (which in my case it does), it will use cuDNN’s RNN implementation, so I expected performance to be identical. Is it possible I’m under the wrong impression and that, in fact, libtorch is using custom kernels which outperform cuDNN, at least on my hardware? Is there something else I am overlooking?
For context: to do the measurements I wrote two C++ programs, one links to libtorch and the other to libcudnn. After setting up the LSTM weights/descriptors/etc I create a large input tensor and run it through the LSTM 10 times, keeping track of start and end time for the whole loop; the program prints the difference between these times and then exist. I am running on a GTX 1080.
Any insight will be super helpful! Thanks in advance.