Debugging CUDA out of memory (16GB GPU, 14+ GB reserved)

motown_dad · November 22, 2023, 7:17pm

I’m pretty new and want to learn how to debug GPU memory allocation.

My setup:

paperspace, machine with A4000 16G GPU
single notebook running
playing with DINOv2, just using the embedding part with pre-trained weights
inspecting the model, it has ~427M params, so even with float32 that should be around 1.7GB
loading 280x280 images which I want to get embedding, 100 images x 280x280x3, with float32 should be under 100MB

I’m still getting RuntimeError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 0; 15.73 GiB total capacity; 13.80 GiB already allocated; 23.12 MiB free; 14.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is code to reproduce:

import torch
from torchvision import transforms
from PIL import Image

m = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
m.eval()
m.cuda()

pil2tensor = transforms.ToTensor()
files = [] # 100 file paths to png files
x = torch.stack([pil2tensor(Image.open(f).resize((14*20,14*20))) for f in files])
x.shape == (100, 3, 280, 280)

bs = 10 # tried many different batch sizes
x_embs = []
for i in range(0, len(files), bs):
    batch = x[i:i+bs]
    batch = batch.cuda()
    x_emb = m(batch)
    x_embs.append(x_emb.cpu())

Usually it fails on the x_emb = m(batch) line, which is when running the model inference. I tried different batches. I tried calling torch.cuda.empty_cache() everywhere, but nothing helps.

It works fine on CPU.

Any advice on how to figure out why is so much memory “reserved”?

ptrblck · November 23, 2023, 1:10am

Besides the parameters and inputs intermediate forward activations could allocate a lot of memory as explained in e.g. this post.