Cudnn uses more GPU memory and takes longer to train

Hi! I’ve been testing resource usage of my pytorch model with and without cudnn. It seems when I have torch.backends.cudnn.enabled=False, training the network is 2 to 3 times faster and uses less memory (bigger batch size). I tested the same network on Nvidia 3090, 2080 ti and 1080 ti. The result is pretty consistent.
I thought cudnn was supposed to improve the performance of training. Is there anything I can do to investigate this further?
I use conda to manage my environment. Here are the specs:
Ubuntu 18.04
CUDA 11.1
Nvidia driver 455.32
Pytorch 1.8.1
Cudnn 8.0.5_0

Yes, cudnn should accelerate the workload. Could you post the model definition so that we could take a look at it?

The exact architecture is a bit complicated. In a nutshell, it’s an encoder-decoder architecture with Efficientnet as the encoder. Is this information helpful?

Unfortunately not, since we are seeing speedups in cudnn for Efficientnet.

We use Efficientnet as the backbone and FPN as the decoder for semantic segmentation. We take outputs from different stages in Efficientnet and feed them to FPN in different layers. Does this help?

I guess “cudnn taking more GPU memory” is because cudnn needs workspaces, and if we set cudnn.benchmark=True, it will allocate the max workspace for each algorithms, which will cost more mems. I have this trouble now which often gets OOM error using cudnn.