When does pytorch use cuDNN for RNN inference?

Joshua_Gevirtz · April 30, 2020, 6:38pm

Hi everyone,

I tried running some LSTMs manually using cuDNN by calling directly out to the library. Everything seems to work great, except the performance isn’t what I would expect when comparing against libtorch: basically my direct-to-cudnn code is slower by a factor of two compared to the identical calculation as performed through libtorch.

I was under the impression that if libtorch claims to have access to cuDNN features (which in my case it does), it will use cuDNN’s RNN implementation, so I expected performance to be identical. Is it possible I’m under the wrong impression and that, in fact, libtorch is using custom kernels which outperform cuDNN, at least on my hardware? Is there something else I am overlooking?

For context: to do the measurements I wrote two C++ programs, one links to libtorch and the other to libcudnn. After setting up the LSTM weights/descriptors/etc I create a large input tensor and run it through the LSTM 10 times, keeping track of start and end time for the whole loop; the program prints the difference between these times and then exist. I am running on a GTX 1080.

Any insight will be super helpful! Thanks in advance.

tom · May 1, 2020, 10:54am

You can check whether CuDNN is used using e.g. nvprof to list the kernels that are being run. This might also give you insights what else is going on.
Many people asking about performance oddities forget to synchronize the GPU before the measurement begins and ends, this may or may not be the case in your testing.

Best regards

Thomas

Joshua_Gevirtz · May 1, 2020, 2:19pm

Thanks for your input, @tom. Actually, after I posted I did a bit of inspecting (using oprofile because I had the logs sitting around so I didn’t have to run anything new) and indeed it looks like cudNN is being used my libtorch, which leaves me befuddled. It isn’t a synchronization issue because, just in case, I do a GPU -> CPU transfer, which is blocking.

I wonder if it might have to do with the way I build my cuDNN application vs. the way libtorch is built? On that theory, I’ve tried both dynamic and static linking in my cuDNN test app without seeing any difference in terms of performance.