Debugging CUDA out of memory (16GB GPU, 14+ GB reserved)

I’m pretty new and want to learn how to debug GPU memory allocation.

My setup:

  • paperspace, machine with A4000 16G GPU
  • single notebook running
  • playing with DINOv2, just using the embedding part with pre-trained weights
  • inspecting the model, it has ~427M params, so even with float32 that should be around 1.7GB
  • loading 280x280 images which I want to get embedding, 100 images x 280x280x3, with float32 should be under 100MB

I’m still getting RuntimeError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 0; 15.73 GiB total capacity; 13.80 GiB already allocated; 23.12 MiB free; 14.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is code to reproduce:

import torch
from torchvision import transforms
from PIL import Image

m = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
m.eval()
m.cuda()

pil2tensor = transforms.ToTensor()
files = [] # 100 file paths to png files
x = torch.stack([pil2tensor(Image.open(f).resize((14*20,14*20))) for f in files])
x.shape == (100, 3, 280, 280)

bs = 10 # tried many different batch sizes
x_embs = []
for i in range(0, len(files), bs):
    batch = x[i:i+bs]
    batch = batch.cuda()
    x_emb = m(batch)
    x_embs.append(x_emb.cpu())

Usually it fails on the x_emb = m(batch) line, which is when running the model inference. I tried different batches. I tried calling torch.cuda.empty_cache() everywhere, but nothing helps.

It works fine on CPU.

Any advice on how to figure out why is so much memory “reserved”?

Besides the parameters and inputs intermediate forward activations could allocate a lot of memory as explained in e.g. this post.

1 Like