I’m afraid I cannot answer the question since I don’t know about details of CuDNN implementation. Apparently RNN, which uses same matrix multiple times, exploits pre-transposing for performance improvement.
https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/
When performing a GEMM the standard BLAS API allows you to transpose either of the two input matrices. Some of the four combinations of transpose/not-transposed run slightly faster or slower than others. Depending on the way that the equations are mapped to the computation, a slower version of the GEMM may be used. By performing a transpose operation up-front on the weight matrix, each step can be made slightly faster. This comes at the cost of the transpose, but that is fairly cheap, so if the transposed matrix is to be used for more than a few iterations it is often worth it.
It would be nice if someone else can elaborate this in more general models, including feed-forward network.