GPU memory allocation with inference_mode(True)

I have trained a Unet-esque model on 512x512 images. Training and evaluation perform without error.

However, I would like to evaluate the model on a much larger image, ~15,000x40,000.

I have 24 GB of total GPU memory. After loading the model’s state dictionary and sending to the gpu, the gpu memory usage is

torch.cuda.memory_reserved(0)/1e9
>> 0.24

After sending the “bigimage” to the gpu, the total GPU usage is ~7.5 GB. Which is expected given the 3 channel float32 image.

Then, I try to evaluate the model as follows:

with torch.inference_mode(True):
    imt = torch.transforms.ToTensor(im)[None]
    imc = imt.to(device)
    out = model(imc)

Which throws the following OOM error:

RuntimeError: CUDA out of memory. Tried to allocate 148.56 GiB (GPU 0; 23.68 GiB total capacity; 7.07 GiB already allocated; 14.36 GiB free; 7.08 GiB reserved in total by PyTorch)

My question is, why is so much memory being allocated while within inference_mode?

The forward pass would calculate the forward activations, which would also need to be stored temporarily, so the OOM issue could be expected. What was the peak memory usage using the 512x512 images?

Ah yes, that makes sense. Is there a way for the activation tensors to be overwritten or utilized ‘inplace’ (as I write this I am realizing how implausible this is). While training on the 512x512, the peak memory usage is ~6.1 GB.

I’m not sure, if you could save more memory than is already saved by using no_grad() (or the inference_mode()) wrapper. These context managers will make sure to avoid storing forward activations, which would be needed to calculate the gradients, so each forward activation should be deleted after its usage. However, since these activations still have to be computed, you might still run OOM.