Runaway Loss on Mask R-CNN finetuning on custom dataset

Udinanon · July 12, 2022, 3:43pm

I’m working on a fine tuning of the Mask R-CNN model, trying to use it on the EgoHands dataset to get hands instance segmentation.
To achieve this i used TorchVision Object Detection Finetuning Tutorial — PyTorch Tutorials 1.12.0+cu102 documentation this tutorial as a reference point.
I finally created my dataset loader, and i tried running the model on the dataset. It seems to work correctly, but the loss values are completely off, starting randomly from 600 to -900 and usually spiraling into negative infinity fast.
I can’t understand why this is happening, the dataset i’m using and the PennFudan used in the example seem similar and the dataset class i built should be correct
I’ve tried changing SGD to Adamw fro the optimizer and fiddling with the hyperparameters, but the result is the same and this to me hints at some fundamental error
my code is here Google Colab

ptrblck · July 13, 2022, 4:38am

This sounds concerning as the “standard” loss functions are not creating a negative loss.
Check what your outputs and targets represent and why the loss is negative first before trying to change the hyperparameters.

Udinanon · July 13, 2022, 2:52pm

by running the model in eval mode and comparing the results with my targets everything seems to be in order, a dictionary with areas, bounding boxes, masks and labels all as tensors.

Also, by more thoroughly reading my results, it seems that the driving error is the loss_mask value, while all other losses seem to be in normal ranges.
The odd thing is that this value does not appear in any way in the tutorial i am following

I think something here might be the cause, maybe the masks are the wrong shape/size and this causes issue to the loss computation, but i can’t really find much regarding this specific loss

Udinanon · July 13, 2022, 4:13pm

No that’s not accurate, the error appears when the MaskRCNN model is used, it obviously does not appear in the parts where FasterRCNN is used, i didn’t notice the difference in model
The masks generated by the model are apparently of the same size
the error was mine, in building the masks for the targets.
I misunderstood how the masks should have been built, so i built masks with 0 for background and 255 for the only class, instead it should have been 1
Now that this error is fixed, the training seems to be going perfectly fine.

Ahmad_Obeid · September 15, 2023, 5:07am

Thank you for brining this to my attention. I was facing the same problem and it was solved by setting the mask to 1 instead of 255

PremRaj_Kala · February 5, 2025, 5:46pm

Saved my day too! Thanks!