Hi, I am trying to run a faster r-cnn model based on the torchvision example for a custom dataset.
However, I have noticed that when training, if xmax is smaller than xmin, the rpn_box_reg loss goes to nan. xmax and ymax represent the top left corner and xmin and ymin represent the bottom right corner. This is a snippet of the error that i get with the bounding boxes printed:
tensor([[ 44., 108., 49., 224.],
[ 29., 73., 210., 230.],
[ 31., 58., 139., 228.],
[ 22., 43., 339., 222.]], device='cuda:0')
Epoch: [0] [ 0/1173] eta: 0:09:46 lr: 0.000000 loss: 9.3683 (9.3683) loss_classifier: 1.7522 (1.7522) loss_box_reg: 0.0755 (0.0755) loss_objectness: 6.1522 (6.1522) loss_rpn_box_reg: 1.3884 (1.3884) time: 0.4997 data: 0.1162 max mem: 5696
tensor([[ 0., 0., 640., 512.]], device='cuda:0')
tensor([[ 28., 57., 197., 220.]], device='cuda:0')
tensor([[ 23., 46., 281., 222.]], device='cuda:0')
tensor([[ 20., 28., 328., 210.]], device='cuda:0')
tensor([[ 37., 45., 47., 161.],
[ 31., 39., 111., 154.]], device='cuda:0')
tensor([[ 0., 0., 640., 512.]], device='cuda:0')
tensor([[ 33., 85., 546., 222.],
[ 31., 85., 527., 213.]], device='cuda:0')
tensor([[ 40., 76., 29., 211.],
[ 64., 51., 26., 206.],
[ 40., 77., 1., 221.]], device='cuda:0')
Loss is nan, stopping training
{'loss_classifier': tensor(1.78, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'loss_objectness': tensor(16.28, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)}
An exception has occurred, use %tb to see the full traceback.
As you can see, for each box is set as [xmin, ymin, xmax, ymax]
.
Thank you in advance.