OutOfMemory error while training MaskRCNN for a few epochs

I’m trying to train a maskrcnn_resnet50_fpn for object detection.

I’ve created my data loader, with the input as 256x256 colour image, and the output as targets with boxes and corresponding labels.

# BATCH_SIZE = 1
images, targets = next(iter(test_dataloader))
print(f"Feature shape: {images[0].shape}")
print("Feature")
print(images[0])

print("Targets")
print(targets[0])
Feature shape: torch.Size([3, 256, 256])
Feature
tensor([[[0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         ...,
         [0.8863, 0.8863, 0.9137,  ..., 0.9451, 0.9451, 0.9451],
         [0.9020, 0.8941, 0.8902,  ..., 0.9490, 0.9490, 0.9490],
         [0.8706, 0.8667, 0.8667,  ..., 0.9490, 0.9490, 0.9490]],

        [[0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         ...,
         [0.8745, 0.8745, 0.9020,  ..., 0.9451, 0.9451, 0.9451],
         [0.8863, 0.8824, 0.8745,  ..., 0.9490, 0.9490, 0.9490],
         [0.8549, 0.8510, 0.8510,  ..., 0.9490, 0.9490, 0.9490]],

        [[0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         ...,
         [0.9137, 0.9137, 0.9412,  ..., 0.9451, 0.9451, 0.9451],
         [0.9412, 0.9333, 0.9294,  ..., 0.9490, 0.9490, 0.9490],
         [0.9137, 0.9098, 0.9098,  ..., 0.9490, 0.9490, 0.9490]]])
Targets
{'boxes': tensor([[142.8750, 135.8750, 176.1250, 169.1250],
        [130.8750, 116.8750, 164.1250, 150.1250],
        [ 91.1250, 180.1250, 113.8750, 202.8750],
        [131.1250, 145.1250, 153.8750, 167.8750],
        [ 65.1250, 103.1250,  87.8750, 125.8750],
        [ -3.8750,  91.1250,  18.8750, 113.8750],
        [208.1250,  21.1250, 230.8750,  43.8750]]), 'labels': tensor([3, 3, 4, 4, 4, 4, 4])}

This is how I create the model:

model = maskrcnn_resnet50_fpn(weights=MaskRCNN_ResNet50_FPN_Weights.DEFAULT)
model.roi_heads.box_predictor.cls_score = nn.Linear(in_features=1024, out_features=5, bias=True)
model.roi_heads.mask_predictor.mask_fcn_logits = nn.Conv2d(256, 5, kernel_size=(1, 1), stride=(1, 1))

And this is my train function:

def train(
    model, train_dataloader, test_dataloader, optimizer, epochs, device=device
):
    model.to(device)
    for epoch in tqdm(range(epochs)):
        train_loss = 0
        model.train()
        for batch, (X_train, y_train) in enumerate(train_dataloader):
            X_train = list(image.to(device) for image in X_train)
            y_train = [{k: v.to(device) for k, v in t.items()} for t in y_train]
            loss_dict = model(X_train, y_train)
            losses = sum(loss for loss in loss_dict.values())
            train_loss += losses.detach().item()
            optimizer.zero_grad()
            losses.backward()
            optimizer.step()
        train_loss /= len(train_dataloader)
        
        test_loss = 0
        for X_test, y_test in test_dataloader:
            model.train()
            X_test = list(image.to(device) for image in X_test)
            y_test = [{k: v.to(device) for k, v in t.items()} for t in y_test]
            test_loss_dict = model(X_test, y_test)
            test_losses = sum(loss for loss in test_loss_dict.values())
            test_loss += test_losses.detach().item()
        
        print(f"\nEpoch: {epoch + 1}")
        print(f"Loss: {train_loss:.5f} | Test loss: {test_loss:.5f}")

This code will train the model and the losses will reduce for a bit, but around the fifth epoch I get this error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 79.10 GiB total capacity; 76.26 GiB already allocated; 96.88 MiB free; 77.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I’m already using torch.cuda.empty_cache() and I’ve already set the batch size to 1, so I’m not sure what else I can do to make this issue go away. My hardware’s also pretty good, and definitely good enough that I should be able to train for more than 5 epochs. Plus, from this post, I’m already detaching the losses but this is not changing anything.

Any idea what might be happening or how to fix it?

I don’t see anything obviously suspicious in the training loop in terms of memory usage, but you might want to use model.eval() and with torch.no_grad() in your test loop as otherwise you would be updating your model’s normalization statistics and using additional memory to store intermediate activations during your test/validation step.

However, while there would be additional memory usage without with torch.no_grad(), it doesn’t really explain why the OOM happens after 5 epochs.
Are you seeing growing memory usage during training with e.g., torch.cuda.memory_stats — PyTorch 2.0 documentation?
If you do, a simple debugging approach would be to simply ablate the training loop by removing operations until you do not see increasing memory usage. If the responsible operation(s) are unexpected, then it could be a real bug in the framework.

Not sure if this is the exact same issue you’re having here but I will mention it anyway as I have had some strange CUDA OOM errors when training very moderate networks (like trying to allocate 1TB on a GPU) due to mis-matches between the shapes of tensors in the loss computation. For instance I was trying to calculate the loss on tensors of shapes (N x p) and (N x p x 1) or (N x 1 x p) or so on. Add in some checks to make sure that all your tensors are definitely the same shapes you expect.