Mask RCNN Loss is NaN

I am following this tutorial and I have only changed the number of classes. Mine is 13. Now I have also added another transformation to resize the images because they were too large. I am training on a single GPU with a batch size of 1 and a learning rate of 0.005 but lowering still results in a Loss is NaN. I haven’t tried gradient clipping or normalisation because I am not really certain how to do it in the pre-implemented architecture. Additionally my dataset consists of single objects within the image. Could it be that due to the fact that the tensor is sparse, this causes the loss to behave in this way?

1 Like

Is the loss growing until it eventually yields a NaN values or do you encounter the NaN just in a single step?

In just a single step. What could possibly be wrong?

Hey I also experienced the same thing. Did you solve this already?

Hi, I am having the same issue using 15 classes.
Has anyone found a solution?

Thanks

Same issue with, I just get this after one step:

Loss is nan, stopping training
{'loss_classifier': tensor(0.9567, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), 'loss_mask': tensor(2.6179, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_objectness': tensor(13.4737, device='cuda:0',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(18.5134, device='cuda:0', grad_fn=<DivBackward0>)}

What does this mean?

Reducing the learning rate, helps to get it further but it still stops after 3 steps

I was able to fix it for my use case, for anyone out there who might have the same. The problem was that I had quite a lot of very small pixel regions/boxes which messed things up. The way the script I use works, it creates box for every isolated pixel region, even if it just the size of 3 pixels. So if you have masks with a lot of very small pieces of mask you probably have the same problem as I did. I removed all isolates pieces of the mask which surface’s are smaller than 2500 pixels. Now the training works well, without having to reduce the learning rate.

3 Likes

Hey would you mind sharing your implementation I also have NaN values.

1 Like

I’m sorry I don’t have the code at hand anymore. Have you already tried reducing learning rate, and are you sure that your dataset doesn’t contains labels that are super small?

I am getting the same error… on a different tensor Having trouble migrating to torchvision FasterRCNN without any mask data added yet.

seems the masks are NOT optional and perhaps I need to re-add it to my dataset and loader ?

I found out my mistake … check that you’re properly calculating the box min_max… for COCO i was under the assumption it was just part of the datasets data… when in fact you need something like this :

        # get bounding box coordinates for each mask
        num_targets = len(targets)
        boxes = []
        for i in range(num_targets):
            box = targets[i]["bbox"]
            xmin = box[0]
            xmax = box[0] + box[2]
            ymin = box[1]
            ymax = box[1] + box[3]
            boxes.append([xmin, ymin, xmax, ymax])

Thank you so much!
I’ve been thinking about a solution for quite a while now, and yeah, the given targets had the shape [x_min, y_min, w, h]. :slightly_smiling_face:

1 Like

As @satrya-sabeni mentioned, reducing the learning rate, worked for me as well…

I think normalizing the images plays a major role(Saying this because I had correct annotations and a moderately low learning rate). Especially when they are gray images or images which is dominant with a type of color.

From my experience, the loss_objectness was shooting up to ‘nan’ during the warmup phase and the initial loss was around 2400.
Once I normalized the tensors, the warmup epoch started with a loss of 22 instead of 2400.

After normalizing the images, I can start the training with a learning rate of 0.001 without the nan problems.

1 Like

Thank you for your advice. For a general understanding, please tell me why to normalize twice. If there is a GeneralizedRCNNTransform in the Key point RC itself that does it repeatedly.

KeypointRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(640, 672, 704, 736, 768, 800), max_size=1333, mode='bilinear')
  )
......