I’m trying to train a maskrcnn_resnet50_fpn
for object detection.
I’ve created my data loader, with the input as 256x256 colour image, and the output as targets with boxes and corresponding labels.
# BATCH_SIZE = 1
images, targets = next(iter(test_dataloader))
print(f"Feature shape: {images[0].shape}")
print("Feature")
print(images[0])
print("Targets")
print(targets[0])
Feature shape: torch.Size([3, 256, 256])
Feature
tensor([[[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
...,
[0.8863, 0.8863, 0.9137, ..., 0.9451, 0.9451, 0.9451],
[0.9020, 0.8941, 0.8902, ..., 0.9490, 0.9490, 0.9490],
[0.8706, 0.8667, 0.8667, ..., 0.9490, 0.9490, 0.9490]],
[[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
...,
[0.8745, 0.8745, 0.9020, ..., 0.9451, 0.9451, 0.9451],
[0.8863, 0.8824, 0.8745, ..., 0.9490, 0.9490, 0.9490],
[0.8549, 0.8510, 0.8510, ..., 0.9490, 0.9490, 0.9490]],
[[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
[0.9451, 0.9451, 0.9451, ..., 0.9490, 0.9490, 0.9490],
...,
[0.9137, 0.9137, 0.9412, ..., 0.9451, 0.9451, 0.9451],
[0.9412, 0.9333, 0.9294, ..., 0.9490, 0.9490, 0.9490],
[0.9137, 0.9098, 0.9098, ..., 0.9490, 0.9490, 0.9490]]])
Targets
{'boxes': tensor([[142.8750, 135.8750, 176.1250, 169.1250],
[130.8750, 116.8750, 164.1250, 150.1250],
[ 91.1250, 180.1250, 113.8750, 202.8750],
[131.1250, 145.1250, 153.8750, 167.8750],
[ 65.1250, 103.1250, 87.8750, 125.8750],
[ -3.8750, 91.1250, 18.8750, 113.8750],
[208.1250, 21.1250, 230.8750, 43.8750]]), 'labels': tensor([3, 3, 4, 4, 4, 4, 4])}
This is how I create the model:
model = maskrcnn_resnet50_fpn(weights=MaskRCNN_ResNet50_FPN_Weights.DEFAULT)
model.roi_heads.box_predictor.cls_score = nn.Linear(in_features=1024, out_features=5, bias=True)
model.roi_heads.mask_predictor.mask_fcn_logits = nn.Conv2d(256, 5, kernel_size=(1, 1), stride=(1, 1))
And this is my train function:
def train(
model, train_dataloader, test_dataloader, optimizer, epochs, device=device
):
model.to(device)
for epoch in tqdm(range(epochs)):
train_loss = 0
model.train()
for batch, (X_train, y_train) in enumerate(train_dataloader):
X_train = list(image.to(device) for image in X_train)
y_train = [{k: v.to(device) for k, v in t.items()} for t in y_train]
loss_dict = model(X_train, y_train)
losses = sum(loss for loss in loss_dict.values())
train_loss += losses.detach().item()
optimizer.zero_grad()
losses.backward()
optimizer.step()
train_loss /= len(train_dataloader)
test_loss = 0
for X_test, y_test in test_dataloader:
model.train()
X_test = list(image.to(device) for image in X_test)
y_test = [{k: v.to(device) for k, v in t.items()} for t in y_test]
test_loss_dict = model(X_test, y_test)
test_losses = sum(loss for loss in test_loss_dict.values())
test_loss += test_losses.detach().item()
print(f"\nEpoch: {epoch + 1}")
print(f"Loss: {train_loss:.5f} | Test loss: {test_loss:.5f}")
This code will train the model and the losses will reduce for a bit, but around the fifth epoch I get this error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 79.10 GiB total capacity; 76.26 GiB already allocated; 96.88 MiB free; 77.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I’m already using torch.cuda.empty_cache()
and I’ve already set the batch size to 1, so I’m not sure what else I can do to make this issue go away. My hardware’s also pretty good, and definitely good enough that I should be able to train for more than 5 epochs. Plus, from this post, I’m already detaching the losses but this is not changing anything.
Any idea what might be happening or how to fix it?