I am training a progressive GAN model with torch.backends.cudnn.benchmark = True. The ProGAN progressively add more layers to the model during training to handle higher resolution images.
I notice that at the beginning of the training the GPU memory consumption fluctuate a lot, sometimes it exceeds 48 GB memory and lead to the CUDNN_STATUS_INTERNAL_ERROR. However, after the period of fluctuation, the memory consumption is back to normal, which is only around 3~4 GB. I suspect the fluctuation is from the auto-tuning of the convolution algorithm by CuDNN, so I set torch.backends.cudnn.benchmark = False, then everything runs smoothly.
My question is (1) should I use cudnn.benchmark when the network, i.e. ProGAN, changes its structure after a few epochs, like 50 epochs. (2) Is the excessive memory consumption from the auto-tuning? Or is it a bug in my code?