Rpn_box_reg loss is nan

ahmed · July 12, 2019, 10:36pm

Hi, I am trying to run a faster r-cnn model based on the torchvision example for a custom dataset.

However, I have noticed that when training, if xmax is smaller than xmin, the rpn_box_reg loss goes to nan. xmax and ymax represent the top left corner and xmin and ymin represent the bottom right corner. This is a snippet of the error that i get with the bounding boxes printed:

tensor([[ 44., 108.,  49., 224.],
        [ 29.,  73., 210., 230.],
        [ 31.,  58., 139., 228.],
        [ 22.,  43., 339., 222.]], device='cuda:0')
Epoch: [0]  [   0/1173]  eta: 0:09:46  lr: 0.000000  loss: 9.3683 (9.3683)  loss_classifier: 1.7522 (1.7522)  loss_box_reg: 0.0755 (0.0755)  loss_objectness: 6.1522 (6.1522)  loss_rpn_box_reg: 1.3884 (1.3884)  time: 0.4997  data: 0.1162  max mem: 5696
tensor([[  0.,   0., 640., 512.]], device='cuda:0')
tensor([[ 28.,  57., 197., 220.]], device='cuda:0')
tensor([[ 23.,  46., 281., 222.]], device='cuda:0')
tensor([[ 20.,  28., 328., 210.]], device='cuda:0')
tensor([[ 37.,  45.,  47., 161.],
        [ 31.,  39., 111., 154.]], device='cuda:0')
tensor([[  0.,   0., 640., 512.]], device='cuda:0')
tensor([[ 33.,  85., 546., 222.],
        [ 31.,  85., 527., 213.]], device='cuda:0')
tensor([[ 40.,  76.,  29., 211.],
        [ 64.,  51.,  26., 206.],
        [ 40.,  77.,   1., 221.]], device='cuda:0')
Loss is nan, stopping training
{'loss_classifier': tensor(1.78, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'loss_objectness': tensor(16.28, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)}
An exception has occurred, use %tb to see the full traceback.

As you can see, for each box is set as [xmin, ymin, xmax, ymax].

Thank you in advance.

Diego · July 13, 2019, 1:33am

Hello, sometimes if your learning rate is too high the proposals will go outside the image and the rpn_box_regression loss will be too high, resulting in nan eventually. Try printing the rpn_box_regression loss and see if this is the case, if so, try lowering the learning rate. Remember to scale your learning rate linearly according to your batch size. Hope this helps

ahmed · July 13, 2019, 8:05pm

Thank you for your quick reply. I have tried reducing the learning all the way down to 0.00001, but i continue to get the same issue. This is the settings:

params = [p for p in model_ft.parameters() if p.requires_grad]

optimizer = torch.optim.SGD(params, lr=0.00001,
                            momentum=0.9, weight_decay=0.0005)

lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=1,
                                               gamma=0.1)

I have noticed that this only seems to emerge when xmax value is lower than xmin value. Would you have any idea why this may be the case?

I will have a look at the rpn_box_regression as you suggested to see if the values are high.

Diego · July 13, 2019, 8:39pm

You can use the anomaly detection tool to check where the nans are produced. In my experience nans appear when you have high values in your gradients or if you are doing a mathematically undefined operation (e.g log(0)). In this case the latter is not likely, so it must be the former. Try and double check that your dataloader is working correctly and that it is providing you with the correct annotations.

tree33 · July 14, 2019, 7:29am

Hi, a simple alternative would be to preprocess your groundtruth before training,
you can discard the invalid samples in your groundtruth.
several things to notice:
1 xmax, ymax > xmin ymin
2 xmax ymax and xmin ymin are always inside your image.

ahmed · July 15, 2019, 1:24am

Thank you for you advice. I have checked the datalaoder, and it seems to be fine. I even tested the annotations that were giving me issues by plotting the images and placing the bounding boxes on them and it looks to be working as it should. I haven’t managed to get the anomaly detection tool working yet. I will continue to work on it tomorrow. Would you have an example of how it would be used?

ahmed · July 15, 2019, 1:43am

Hi, all the samples that I am using have are valid. The issue that I am facing is when the x1 (xmax) value is smaller than x2 (xmin), the loss rpn_box_reg loss becomes NaN. For example, for the image below, the bouding boxes are tensor([[ 53., 89., 7., 226.]]) i.e. [x2, y2, x1, y1]. When the x1 value is smaller than x2, the loss goes to zero, however, it works fine when x1>x2. In fact, it trains quite well. As you can see, the values are correct as the cyclist has the correct bounding box based on the values above. I hope this makes it a bit more clear as to the issue I am facing.

tree33 · July 15, 2019, 1:48am

Well I think I get your point, and why don’t you remove this annotation from the beginning?

ahmed · July 15, 2019, 8:55am

Thank you for your reply. I would preferably like to keep these annotations as possible as they’re limited annotations for cyclists. Also, would it not mean that during evaluation, the model outputs would be always be coordinates where x1>x2 and therefore it would never pick up the objects like the cyclist in the lower left-hand of the image?

I’m hoping this is making sense. I am new to machine learning and pytorch so I may not fully understand some of the concepts.

kl_divergence · July 15, 2019, 9:27am

Even I discussed this issue here. I tried with two different datasets. In both cases, rpn_box_reg becomes nan in first epoch itself.

ahmed · July 15, 2019, 9:59am

I will let you know if I get this working.

kl_divergence · July 15, 2019, 10:48am

Is this torchvision specific or my code has some issue ? I’ve filed an issue. I checked my data thoroughly and everything seems to be working fine except the value of specific loss.

ahmed · July 15, 2019, 11:35am

i’m not sure but i think it may have something to do with how the rpn box loss calculated. again, i’m not sure and this is only a guess. Perhaps as x1 is smaller than x2, a negative loss is calculated and that is what is causing the NaN value. I am still looking into how the rpn box loss is being calculated based on the rpn.py file.

kl_divergence · July 15, 2019, 1:47pm

Let us know if you find something.

ahmed · July 15, 2019, 2:09pm

I have decreased the learning rate, which is quite small. Would you think that i decrease it further?

kl_divergence · July 16, 2019, 12:05pm

No, decreasing LR won’t help. In my case, fmassa says that I’ve an issue with bbox notation. I’ll work on the fix

ahmed · July 17, 2019, 12:26pm

This should be the solution to our issue: https://github.com/pytorch/vision/issues/1128

However, we will have to skip those images without annotations. So I have followed the solution given here: Ignore images without annotations

Good luck!

pvtien96 · March 13, 2024, 10:53pm

This is gold “to scale your learning rate linearly according to your batch size”. Thank you!