Loss_objectness is Nan

GGlearning · March 24, 2023, 1:40pm

Hi folks,

While training a Faster RCNN for which the code is available here, I am facing following problem. I guess there is some problem with the input or the target. But what is the problem exactly, I am unable to understand that. Please give suggestions.

Loss is nan, stopping training
{'loss_classifier': tensor(0.0168, grad_fn=<NllLossBackward0>), 'loss_box_reg': tensor(0.0019, grad_fn=<DivBackward0>), 'loss_objectness': tensor(nan, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'loss_rpn_box_reg': tensor(0.3363, grad_fn=<DivBackward0>)}
An exception has occurred, use %tb to see the full traceback.
SystemExit: 1

Training images are normalised and the values lie between 0 and 1. However, Targets are in the form of Dictionary. Following are details about the sample image patch and the targets before feeding to the model. Patch’s histogram is attached as well.

targets keys dict_keys([‘boxes’, ‘labels’, ‘length_labels’, ‘scene_id’, ‘chip_id’, ‘image_id’, ‘area’, ‘iscrowd’])
targets boxes tensor([[226., 681., 236., 691.],
[495., 19., 505., 29.],
[495., 12., 505., 22.],
[704., 704., 714., 714.],
[703., 407., 713., 417.],
[345., 749., 355., 759.],
[336., 700., 346., 710.],
[766., 712., 776., 722.]])
targets labels tensor([1, 3, 3, 1, 1, 1, 2, 1])
targets scene_id 590dd08f71056cacv
targets chip_id tensor(858)
targets image_id tensor(32)
targets area tensor([100., 100., 100., 100., 100., 100., 100., 100.])
targets iscrowd tensor([0, 0, 0, 0, 0, 0, 0, 0])
torch.Size([3, 800, 800])
torch.float32
tensor(0.) tensor(1.)

img1_target

colinlaganier · October 16, 2023, 10:31am

I am having the same issue running it on my M1 MacBook. Have you managed to solve this problem and were you running it an M1/M2 too?