Mask RCNN Loss is NaN

Nikolas_Pitsillos · November 5, 2019, 12:42pm

I am following this tutorial and I have only changed the number of classes. Mine is 13. Now I have also added another transformation to resize the images because they were too large. I am training on a single GPU with a batch size of 1 and a learning rate of 0.005 but lowering still results in a Loss is NaN. I haven’t tried gradient clipping or normalisation because I am not really certain how to do it in the pre-implemented architecture. Additionally my dataset consists of single objects within the image. Could it be that due to the fact that the tensor is sparse, this causes the loss to behave in this way?

ptrblck · November 6, 2019, 4:40am

Is the loss growing until it eventually yields a NaN values or do you encounter the NaN just in a single step?

EboRobert · December 3, 2019, 5:48pm

In just a single step. What could possibly be wrong?

mdtjan · December 15, 2019, 7:08am

Hey I also experienced the same thing. Did you solve this already?

brix · February 14, 2020, 6:34pm

Hi, I am having the same issue using 15 classes.
Has anyone found a solution?

Thanks

satrya-sabeni · February 17, 2020, 10:53am

Same issue with, I just get this after one step:

Loss is nan, stopping training
{'loss_classifier': tensor(0.9567, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), 'loss_mask': tensor(2.6179, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_objectness': tensor(13.4737, device='cuda:0',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(18.5134, device='cuda:0', grad_fn=<DivBackward0>)}

What does this mean?

satrya-sabeni · February 17, 2020, 11:31am

Reducing the learning rate, helps to get it further but it still stops after 3 steps

satrya-sabeni · February 17, 2020, 7:39pm

I was able to fix it for my use case, for anyone out there who might have the same. The problem was that I had quite a lot of very small pixel regions/boxes which messed things up. The way the script I use works, it creates box for every isolated pixel region, even if it just the size of 3 pixels. So if you have masks with a lot of very small pieces of mask you probably have the same problem as I did. I removed all isolates pieces of the mask which surface’s are smaller than 2500 pixels. Now the training works well, without having to reduce the learning rate.

azer · April 27, 2020, 7:29pm

Hey would you mind sharing your implementation I also have NaN values.

satrya-sabeni · April 28, 2020, 1:09pm

I’m sorry I don’t have the code at hand anymore. Have you already tried reducing learning rate, and are you sure that your dataset doesn’t contains labels that are super small?

emcp · April 30, 2020, 11:34pm

I am getting the same error… on a different tensor Having trouble migrating to torchvision FasterRCNN without any mask data added yet.

seems the masks are NOT optional and perhaps I need to re-add it to my dataset and loader ?

emcp · May 4, 2020, 7:55am

I found out my mistake … check that you’re properly calculating the box min_max… for COCO i was under the assumption it was just part of the datasets data… when in fact you need something like this :

        # get bounding box coordinates for each mask
        num_targets = len(targets)
        boxes = []
        for i in range(num_targets):
            box = targets[i]["bbox"]
            xmin = box[0]
            xmax = box[0] + box[2]
            ymin = box[1]
            ymax = box[1] + box[3]
            boxes.append([xmin, ymin, xmax, ymax])

Unity05 · May 6, 2020, 3:33pm

Thank you so much!
I’ve been thinking about a solution for quite a while now, and yeah, the given targets had the shape [x_min, y_min, w, h].

paschalis_m · April 5, 2021, 3:37pm

As @satrya-sabeni mentioned, reducing the learning rate, worked for me as well…

SriniMaiya · March 13, 2022, 10:54pm

I think normalizing the images plays a major role(Saying this because I had correct annotations and a moderately low learning rate). Especially when they are gray images or images which is dominant with a type of color.

From my experience, the loss_objectness was shooting up to ‘nan’ during the warmup phase and the initial loss was around 2400.
Once I normalized the tensors, the warmup epoch started with a loss of 22 instead of 2400.

After normalizing the images, I can start the training with a learning rate of 0.001 without the nan problems.

engineman285 · March 20, 2023, 4:36pm

Thank you for your advice. For a general understanding, please tell me why to normalize twice. If there is a GeneralizedRCNNTransform in the Key point RC itself that does it repeatedly.

KeypointRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(640, 672, 704, 736, 768, 800), max_size=1333, mode='bilinear')
  )
......