Set Torch.backends.cudnn.benchmark = True consumes huge amount of memory

I am training a progressive GAN model with torch.backends.cudnn.benchmark = True. The ProGAN progressively add more layers to the model during training to handle higher resolution images.

I notice that at the beginning of the training the GPU memory consumption fluctuate a lot, sometimes it exceeds 48 GB memory and lead to the CUDNN_STATUS_INTERNAL_ERROR. However, after the period of fluctuation, the memory consumption is back to normal, which is only around 3~4 GB. I suspect the fluctuation is from the auto-tuning of the convolution algorithm by CuDNN, so I set torch.backends.cudnn.benchmark = False, then everything runs smoothly.

My question is (1) should I use cudnn.benchmark when the network, i.e. ProGAN, changes its structure after a few epochs, like 50 epochs. (2) Is the excessive memory consumption from the auto-tuning? Or is it a bug in my code?

  1. Yes, you could use it for better performance especially if you are dealing with static input shapes (or a few different ones). Each new input shape (and conv setup) will rerun the heuristics and will thus slow down your code. Afterwards this setup will be cached.

  2. Yes, it’s coming from running different algorithms internally, which might have different memory requirements.

Thanks for your reply! I am a bit surprised that the benchmark consumes so much memory. I have a 48 GB GPU and still got OOM, while the training of the same network only cost 3~4 GB. What else I could do to lower the memory consumption on benchmark? Currently I could not turn it on without error. Thanks again.

You should not run into an OOM and yes, some algorithms might try to use a huge workspace, if they need to e.g. create copies to permute and realign data.
As a workaround, you could set CUDNN_CONV_WSCAP_DBG to a lower value in MiB to skip algorithms requesting a larger workspace (description of this env var, proposal to expose this mechanism).

2 Likes