MaskRCNN: Training works however kernel dies unexpectedly

Hello, I am new to the forums so I do apologies if I break conventions/standards.
Utilising:
Pytorch 2.4.1
Torchvision 0.19.1

I am training a MaskRCNN model with the imported model maskrcnn_resnet50_fpn_v2 with default weights.

I don’t get any errors when training however randomly the kernel suddenly dies (batch size = 1), It could take 20 images, or it could 70, however it ultimately just dies.

Although I do have error:
"[error] Disposing session as kernel process died ExitCode: 3221225477, Reason: "

Especially always where the model is called (I used to print statements, to work out where it stops)

losses = model(stacksetImages, toDeviceDict)

Things I’ve tried:

  • I’ve checked for nan values in input, I’ve checked that Mask/Image size are the same.
  • I’ve checked that images are in the range [0-1]
  • Ive checked recommended datatypes for Masks/Bbox and images.
  • Searched the error, looked for similar articles.
  • Reviewed other MaskRCNN code available.
  • Set batch size to 1.
  • Reduce the size of images going into the model.
  • Running the model on colab (crashes)
  • Trying GPU on colab (CUDA errors: Invalid Memory Access)
  • Trying other MaskRCNNs (this works, but I can’t tell what I’m doing wrong, because I’ve analysed working notebooks to see that their inputs to the model are fine)

Here is an excerpt of the data I’m putting into the model:

Image: tensor([[[[0.4078, 0.3588, 0.3240,  ..., 0.8667, 0.8667, 0.8667],
          [0.4211, 0.3813, 0.3501,  ..., 0.8667, 0.8667, 0.8667],
          [0.4325, 0.4059, 0.3792,  ..., 0.8667, 0.8667, 0.8667],
          ...,
          [0.6726, 0.6588, 0.6643,  ..., 0.0370, 0.0468, 0.0563],
          [0.6851, 0.6986, 0.7087,  ..., 0.0400, 0.0588, 0.0764],
          [0.7294, 0.7499, 0.7488,  ..., 0.0431, 0.0696, 0.0941]],

         [[0.3686, 0.3155, 0.2786,  ..., 0.8706, 0.8706, 0.8706],
          [0.3841, 0.3402, 0.3064,  ..., 0.8706, 0.8706, 0.8706],
          [0.3972, 0.3671, 0.3377,  ..., 0.8706, 0.8706, 0.8706],
          ...,
          [0.6989, 0.6846, 0.6898,  ..., 0.0448, 0.0560, 0.0670],
          [0.7125, 0.7261, 0.7361,  ..., 0.0478, 0.0685, 0.0882],
          [0.7569, 0.7773, 0.7762,  ..., 0.0509, 0.0793, 0.1059]],

         [[0.2745, 0.2418, 0.2276,  ..., 0.8863, 0.8863, 0.8863],
          [0.2834, 0.2576, 0.2458,  ..., 0.8863, 0.8863, 0.8863],
          [0.2891, 0.2737, 0.2646,  ..., 0.8863, 0.8863, 0.8863],
          ...,
          [0.7303, 0.7160, 0.7222,  ..., 0.0254, 0.0356, 0.0456],
          [0.7439, 0.7574, 0.7683,  ..., 0.0282, 0.0478, 0.0663],
          [0.7882, 0.8087, 0.8084,  ..., 0.0313, 0.0597, 0.0863]]]])
DictInput: [{'masks': Mask([[[False, False, False,  ..., False, False, False],
       [False, False, False,  ..., False, False, False],
       [False, False, False,  ..., False, False, False],
       ...,
       [False, False, False,  ..., False, False, False],
       [False, False, False,  ..., False, False, False],
       [False, False, False,  ..., False, False, False]]]), 'boxes': BoundingBoxes([[370.4248,   0.0000, 386.4934, 357.4296],
               [152.2618,   0.0000, 198.6137, 380.3736],
               [  0.0000, 677.8557, 134.9570, 843.2112],
               [309.5287, 625.7661, 314.6411, 660.0600],
               [312.1872, 563.9846, 315.8681, 583.6185],
               [388.4634, 217.9034, 391.7353, 224.9716],
               [387.6454, 250.3649, 390.5083, 256.3860],
               [389.0769, 284.3971, 391.9398, 290.1563]], format=BoundingBoxFormat.XYXY, canvas_size=(1000, 1000)), 'labels': tensor([1, 1, 1, 1, 1, 1, 1, 1])}]

I’m happy to show code excerpts.

Thank you very much.

Found the solution.

I was passing 1 mask for the whole image, not a mask for every label/box. This ultimately fixed the cuda issues, and the unstable crashing with no error.

eg.

image shape: torch.Size([3, 1000, 1000])
mask shape: torch.Size([10, 1000, 1000])
boxes shape: torch.Size([10, 4])

originally mask was only [1,1000,1000]