CUDA memory builds up linearly with batches on privacy engine

I am training on MNIST images using resnet-18 architecture on a 16 GB GPU memory machine. Everything works perfectly if I do not attach the privacy engine to my optimizer, each batch takes about 1-2 GB GPU memory, which is flushed out after processing each batch, and hence the total GPU memory consumption stays around 2 GB throughout the process of training.

However, when I attach the privacy engine to my optimizer, the used GPU memory keeps accumulating with each batch. Hence, with the processing of each batch, the used GPU memory increases, and soon I receive CUDA out of memory error.

I am not able to understand why GPU memory is not being freed at the end of every batch upon the inclusion of the privacy engine.

Do you observe memory leak during training or test? Also, are you running the latest version of Opacus?

I observe this during training, and I am using the opacus version 0.13.0

I see. Leaks can happen if you do more forward passes than backward passes as the activations do not get deleted. Is that the case for you? Also is it possible to get a minimal reproducing code sample?

Thanks! I think that was the issue, there was an extra forward pass. However, I am not sure why this does not cause any issues when I train without the privacy engine.

1 Like

I was facing a similar problem when I was using GradSampleModule. The problem was that I needed to call model.zero_grad() instead of optim.zero_grad() to clear the accumulated grad_sample attributes.