CUDA out of memory while training FasterRCNN for object detection?

I’m trying to train a fasterrcnn_resnet50_fpn for object detection.

I’ve created my data loader, with the input as 224x224 colour image, and the output as targets with boxes and corresponding labels.

# BATCH_SIZE = 1
images, targets = next(iter(test_dataloader))
print(f"Feature shape: {images[0].shape}")
print("Feature")
print(images[0])

print("Targets")
print(targets[0])
Feature shape: torch.Size([3, 224, 224])
Feature
tensor([[[0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         ...,
         [0.8863, 0.8863, 0.9137,  ..., 0.9451, 0.9451, 0.9451],
         [0.9020, 0.8941, 0.8902,  ..., 0.9490, 0.9490, 0.9490],
         [0.8706, 0.8667, 0.8667,  ..., 0.9490, 0.9490, 0.9490]],

        [[0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         ...,
         [0.8745, 0.8745, 0.9020,  ..., 0.9451, 0.9451, 0.9451],
         [0.8863, 0.8824, 0.8745,  ..., 0.9490, 0.9490, 0.9490],
         [0.8549, 0.8510, 0.8510,  ..., 0.9490, 0.9490, 0.9490]],

        [[0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         [0.9451, 0.9451, 0.9451,  ..., 0.9490, 0.9490, 0.9490],
         ...,
         [0.9137, 0.9137, 0.9412,  ..., 0.9451, 0.9451, 0.9451],
         [0.9412, 0.9333, 0.9294,  ..., 0.9490, 0.9490, 0.9490],
         [0.9137, 0.9098, 0.9098,  ..., 0.9490, 0.9490, 0.9490]]])
Targets
{'boxes': tensor([[142.8750, 135.8750, 176.1250, 169.1250],
        [130.8750, 116.8750, 164.1250, 150.1250],
        [ 91.1250, 180.1250, 113.8750, 202.8750],
        [131.1250, 145.1250, 153.8750, 167.8750],
        [ 65.1250, 103.1250,  87.8750, 125.8750],
        [ -3.8750,  91.1250,  18.8750, 113.8750],
        [208.1250,  21.1250, 230.8750,  43.8750]]), 'labels': tensor([3, 3, 4, 4, 4, 4, 4])}

This is how I create the model:

model = fasterrcnn_resnet50_fpn(weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT)
model.roi_heads.box_predictor.cls_score = nn.Linear(in_features=1024, out_features=5, bias=True) # have four output classes, plus the background

And this is my train function:

def train(
    model, train_dataloader, test_dataloader, optimizer, epochs, device=device
):
    model.to(device)
    for epoch in tqdm(range(epochs)):
        torch.cuda.empty_cache()
        train_loss = 0
        model.train()
        for X_train, y_train) in enumerate(train_dataloader):
            if (batch + 1) % 384 == 0:
                print(f"{batch + 1} / {len(train_dataloader)} batches")
            X_train = list(image.to(device) for image in X_train)
            y_train = [{k: v.to(device) for k, v in t.items()} for t in y_train]
            loss_dict = model(X_train, y_train)
            losses = sum(loss for loss in loss_dict.values())
            train_loss += losses
            optimizer.zero_grad()
            losses.backward()
            optimizer.step()
        train_loss /= len(train_dataloader)
        
        torch.cuda.empty_cache()
        test_loss = 0
        for X_test, y_test in test_dataloader:
            X_test = list(image.to(device) for image in X_test)
            y_test = [{k: v.to(device) for k, v in t.items()} for t in y_test]
            test_loss_dict = model(X_test, y_test)
            test_losses = sum(loss for loss in test_loss_dict.values())
            test_loss += test_losses
        test_loss /= len(test_dataloader)
        
        print(f"\nEpoch: {epoch + 1}")
        print(f"Loss: {train_loss} | Test loss: {test_loss}")

optimizer = torch.optim.Adam(lr=0.001, params=model.parameters())
train(model, train_dataloader, test_dataloader, optimizer, 5)

This code will train the model and the losses will reduce for a bit, but around the fifth epoch I get this error:

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 31.75 GiB total capacity; 30.46 GiB already allocated; 1.50 MiB free; 30.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I’m already using torch.cuda.empty_cache() and I’ve already set the batch size to 1, so I’m not sure what else I can do to make this issue go away. My hardware’s also pretty good, and definitely good enough that I should be able to train for more than 5 epochs.

Any idea what might be happening or how to fix it?

It seems you are appending the losses tensor, which is still attached to the computation graph, in this line of code:

train_loss += losses

Could you detach() this tensor and call item() on it to only accumulate the plain floating point value instead of the tensor and see if this would help?

Ah perfect, that’s worked well. Thanks!