CUDA memory builds up linearly with batches on privacy engine

divyat09 · June 16, 2021, 9:03am

I am training on MNIST images using resnet-18 architecture on a 16 GB GPU memory machine. Everything works perfectly if I do not attach the privacy engine to my optimizer, each batch takes about 1-2 GB GPU memory, which is flushed out after processing each batch, and hence the total GPU memory consumption stays around 2 GB throughout the process of training.

However, when I attach the privacy engine to my optimizer, the used GPU memory keeps accumulating with each batch. Hence, with the processing of each batch, the used GPU memory increases, and soon I receive CUDA out of memory error.

I am not able to understand why GPU memory is not being freed at the end of every batch upon the inclusion of the privacy engine.

alexandresablayrolle · June 16, 2021, 1:44pm

Do you observe memory leak during training or test? Also, are you running the latest version of Opacus?

divyat09 · June 16, 2021, 1:58pm

I observe this during training, and I am using the opacus version 0.13.0

alexandresablayrolle · June 16, 2021, 2:15pm

I see. Leaks can happen if you do more forward passes than backward passes as the activations do not get deleted. Is that the case for you? Also is it possible to get a minimal reproducing code sample?

divyat09 · June 16, 2021, 2:47pm

Thanks! I think that was the issue, there was an extra forward pass. However, I am not sure why this does not cause any issues when I train without the privacy engine.

RevoGen · December 14, 2021, 8:59pm

I was facing a similar problem when I was using GradSampleModule. The problem was that I needed to call model.zero_grad() instead of optim.zero_grad() to clear the accumulated grad_sample attributes.