OOM issue : how to manage GPU memory?

Hello:) I’m qurious about how Pytorch handles GPU allocation with reserved, free, allocated memory.

As trying to train Seq2Seq image generation model with single rtx 3070(8gb), there is OOM issue when the mini batch is over 2.

The picture shows allocated, free, reserved memory are not linearly associated with the batch size, and even(not in pic) the size of model(num of hidden layer) and the size of input image(225x225 or 160x160).

Is there any clear interpretation between (reserved,free,allocated) GPU memory and the model(size of the model, batch size…)?

I think just reducing mini batch size is not the most smart answer for OOM issue.

The error messages might point to different allocation failures, since increasing the batch size might also change the point of failure.
E.g. batch_size=4 might fail in the backward pass when the gradient of layer17 is calculated, which would need 38MiB, while batch_size=32 might fail in the forward pass when layer3 is trying to allocate memory of 296MiB for the forward activation.

Reducing the batch size or model is one possible way to avoid the OOM, using torch.utils.checkpoint could be used to trade compute for memory.

Thanks for your advice for using torch.utils.checkpoint.
Let me see what i can do with it and share the progress soon:)

It’s a been while:) I handled the OOM issue by using gradient accumulation with small batch! Thanks again!

Thank you, @hw-kim-ivan and @ptrblck.

This post has a nice explanation and code example. The code is written in Keras/TF though.