torch.cuda.OutOfMemoryError when training Mask R-CNN

antonl · July 28, 2023, 1:48pm

Hello! I am running into a memory issue that I think I shouldn’t be having with 24 GB of VRAM. I am trying to train Mask R-CNN and find that directly after the first forward call, the allocated memory explodes in size to ~20GB, which I don’t expect since the model is ~200MB and the batch is ~300MB. Then when I call loss.backward() the allocated memory exceeds the boundaries of the GPU and the code crashes. See the code snippet below. What am I doing wrong here?

I have tested a few different settings for PYTORCH_CUDA_ALLOC_CONF and it did help a bit, but it didn’t solve the problem completely.

import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn, MaskRCNN_ResNet50_FPN_Weights

batch_size = 36
img_size = 800
objects_per_image = 10

device = torch.device("cuda") 

model = maskrcnn_resnet50_fpn(weights=MaskRCNN_ResNet50_FPN_Weights.DEFAULT)

model = model.to(device)

print('Memory allocated: ', torch.cuda.max_memory_allocated() * 1e-9, 'GB,', 'memory reserved: ', torch.cuda.max_memory_reserved()*1e-9, 'GB')
# Memory allocated:  0.181971456 GB, memory reserved:  0.20132659200000003 GB

x = [torch.rand((3, img_size, img_size), device=device)] * batch_size
y = [{'boxes': torch.tensor([1, 2, 3, 4], device=device, dtype=torch.int64).repeat(40, 1),
      'labels': torch.ones((objects_per_image), device=device, dtype=torch.int64),
      'masks': torch.randint(img_size, size=(objects_per_image, 1, img_size, img_size), dtype=torch.int64)
      }] * batch_size

print('Memory usage of objects in batch (GB):')
print('\timgs: ', x[0].element_size() * x[0].nelement() * batch_size * 1e-9)
print('\tboxes: ', y[0]['boxes'].element_size() * y[0]['boxes'].nelement() * batch_size * 1e-9)
print('\tlabels: ', y[0]['labels'].element_size() * y[0]['labels'].nelement() * batch_size * 1e-9)
print('\tmasks: ', y[0]['masks'].element_size() * y[0]['masks'].nelement() * batch_size * 1e-9)
# Memory usage of objects in batch (GB):
#        imgs:  0.27648
#        boxes:  4.6080000000000006e-05
#        labels:  2.8800000000000004e-06
#        masks:  1.8432000000000002

print('Memory allocated: ', torch.cuda.max_memory_allocated() * 1e-9, 'GB,', 'memory reserved: ', torch.cuda.max_memory_reserved()*1e-9, 'GB')
# Memory allocated:  0.189653504 GB, memory reserved:  0.20132659200000003 GB

model.train()
output = model(x, y)

print('Memory allocated: ', torch.cuda.max_memory_allocated() * 1e-9, 'GB,', 'memory reserved: ', torch.cuda.max_memory_reserved()*1e-9, 'GB')
# Memory allocated:  21.554223616 GB, memory reserved:  22.085107712000003 GB

torch.cuda.empty_cache()

print('Memory allocated: ', torch.cuda.max_memory_allocated() * 1e-9, 'GB,', 'memory reserved: ', torch.cuda.max_memory_reserved()*1e-9, 'GB')
# Memory allocated:  21.554223616 GB, memory reserved:  22.085107712000003 GB

loss = sum(loss for loss in output.values())
loss.backward()
# torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB (GPU 0; 23.62 GiB 
# total capacity; 20.45 GiB already allocated; 869.31 MiB free; 20.98 GiB reserved in total by PyTorch) If
# reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See 
# documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Also, here you can see the nvidia-smi output right before running the code. As you can see the GPU is largely unoccupied.

ptrblck · July 28, 2023, 8:20pm

The model parameter and input size might be tiny compared to the stored intermediates needed for gradient computation.
Have a look at this post showing an example.

antonl · July 31, 2023, 11:57am

That makes sense - thanks a lot! It seems that for my use case (instance segmentation on 2x1392x1040 images) the memory requirements are quite high. Maskformer for instance requires 10GB of allocated memory when I have a single image per batch. Do you have any recommendations for decreasing memory requirements, aside from decreasing batch size?

ptrblck · July 31, 2023, 12:21pm

You could decrease the spatial dimensions of your inputs as they seem to be quite large and will create large intermediates.