Training ResneXt50_32x4d backbone in KeypointRCNN has large loss value?

Hello,

I am interested in the Keypoint RCNN in the Pytorch function from here. I understand this using Mask RCNN to get kepionts of human body. The link above shows how to use “resnet50” with FPN to combine the keypoint RCNN to detect keypoints. I try to change othe backbones on function of "resnet_fpn_backbone()’. I have tried “resnet18”, “wide_resnet50_2”, “wide_resnet101_2” and “resnext50_32x4d”. I change the backbone and using the pretrained weighs. My training plan is below:

  • learning rate: 0.02
  • learning rate schedule: 60 epoch later reduce 1/10
  • backbone trainable layer: 1
  • backbone using pre-trained
  • train 200 epochs

According to my plan, I train the keypoint head, keypoint predictor and backbone layer 4. The “resnet18”, “wide_resnet50_2” and “wide_resnet101_2” are working. I can see the loss going down and the inference results also good. However, I got a problem on “resnext50_32x4d”. The training loss always very large. I show the part of training matrix below:

Epoch: [198]  [  0/415]  eta: 0:57:42  lr: 0.000020  loss: 2.1488 (2.1488)  loss_classifier: 0.0365 (0.0365)  loss_box_reg: 0.0660 (0.0660)  loss_keypoint: 2.0387 (2.0387)  loss_objectness: 0.0062 (0.0062)  loss_rpn_box_reg: 0.0014 (0.0014)  backbone_lr: 0.0000 (0.0000)  time: 8.3441  data: 7.0423  max mem: 0
Epoch: [198]  [400/415]  eta: 0:00:17  lr: 0.000020  loss: 2.0049 (62287665.0680)  loss_classifier: 0.0461 (0.0554)  loss_box_reg: 0.0668 (0.0695)  loss_keypoint: 1.8585 (62287477.6360)  loss_objectness: 0.0057 (62.1232)  loss_rpn_box_reg: 0.0041 (125.3357)  backbone_lr: 0.0000 (0.0000)  time: 1.1544  data: 0.0172  max mem: 0
Epoch: [198]  [414/415]  eta: 0:00:01  lr: 0.000020  loss: 2.3009 (60186394.5416)  loss_classifier: 0.0575 (0.0555)  loss_box_reg: 0.0803 (0.0699)  loss_keypoint: 2.1628 (60186213.4271)  loss_objectness: 0.0120 (60.0281)  loss_rpn_box_reg: 0.0050 (121.1077)  backbone_lr: 0.0000 (0.0000)  time: 1.1668  data: 0.0181  max mem: 0
Epoch: [198] Total time: 0:08:04 (1.1677 s / it)
Validation:  [  0/100]  eta: 0:07:19  loss: 2.4968 (2.4968)  loss_classifier: 0.0300 (0.0300)  loss_box_reg: 0.0487 (0.0487)  loss_keypoint: 2.4162 (2.4162)  loss_objectness: 0.0009 (0.0009)  loss_rpn_box_reg: 0.0009 (0.0009)  pixDist: 1.6672 (1.6672)  model_time: 0.3858 (0.3858)  time: 4.3959  data: 3.9930  max mem: 0
Validation:  [ 99/100]  eta: 0:00:00  loss: 1.9306 (2.4355)  loss_classifier: 0.0234 (0.0530)  loss_box_reg: 0.0307 (0.0636)  loss_keypoint: 1.8603 (2.2977)  loss_objectness: 0.0025 (0.0165)  loss_rpn_box_reg: 0.0029 (0.0048)  pixDist: 5.9226 (18.6548)  model_time: 0.2072 (0.2299)  time: 0.2167  data: 0.0010  max mem: 0
Validation: Total time: 0:00:28 (0.2815 s / it)
Averaged stats: loss: 1.9306 (2.4355)  loss_classifier: 0.0234 (0.0530)  loss_box_reg: 0.0307 (0.0636)  loss_keypoint: 1.8603 (2.2977)  loss_objectness: 0.0025 (0.0165)  loss_rpn_box_reg: 0.0029 (0.0048)  pixDist: 5.9226 (18.6548)  model_time: 0.2072 (0.2299)
Epoch: [199]  [  0/415]  eta: 0:52:42  lr: 0.000020  loss: 1.6407 (1.6407)  loss_classifier: 0.0340 (0.0340)  loss_box_reg: 0.0583 (0.0583)  loss_keypoint: 1.5362 (1.5362)  loss_objectness: 0.0035 (0.0035)  loss_rpn_box_reg: 0.0086 (0.0086)  backbone_lr: 0.0000 (0.0000)  time: 7.6195  data: 6.4353  max mem: 0
Epoch: [199]  [400/415]  eta: 0:00:17  lr: 0.000020  loss: 2.2163 (49125692.3149)  loss_classifier: 0.0541 (933653.1231)  loss_box_reg: 0.0530 (0.0693)  loss_keypoint: 2.0583 (48192017.1261)  loss_objectness: 0.0131 (11.7248)  loss_rpn_box_reg: 0.0035 (10.2920)  backbone_lr: 0.0000 (0.0000)  time: 1.1280  data: 0.0092  max mem: 0
Epoch: [199]  [414/415]  eta: 0:00:01  lr: 0.000020  loss: 2.1601 (47468440.1186)  loss_classifier: 0.0441 (902156.3929)  loss_box_reg: 0.0660 (0.0693)  loss_keypoint: 2.0583 (46566262.4014)  loss_objectness: 0.0084 (11.3297)  loss_rpn_box_reg: 0.0036 (9.9450)  backbone_lr: 0.0000 (0.0000)  time: 1.1506  data: 0.0082  max mem: 0
Epoch: [199] Total time: 0:08:02 (1.1627 s / it)
Validation:  [  0/100]  eta: 0:07:16  loss: 2.3626 (2.3626)  loss_classifier: 0.0189 (0.0189)  loss_box_reg: 0.0500 (0.0500)  loss_keypoint: 2.2917 (2.2917)  loss_objectness: 0.0015 (0.0015)  loss_rpn_box_reg: 0.0005 (0.0005)  pixDist: 1.6832 (1.6832)  model_time: 0.3154 (0.3154)  time: 4.3678  data: 4.0433  max mem: 0
Validation:  [ 99/100]  eta: 0:00:00  loss: 1.5503 (2.5065)  loss_classifier: 0.0220 (0.0534)  loss_box_reg: 0.0319 (0.0631)  loss_keypoint: 1.4727 (2.3703)  loss_objectness: 0.0034 (0.0160)  loss_rpn_box_reg: 0.0029 (0.0037)  pixDist: 5.9424 (17.9660)  model_time: 0.1999 (0.2264)  time: 0.2159  data: 0.0010  max mem: 0
Validation: Total time: 0:00:27 (0.2777 s / it)
Averaged stats: loss: 1.5503 (2.5065)  loss_classifier: 0.0220 (0.0534)  loss_box_reg: 0.0319 (0.0631)  loss_keypoint: 1.4727 (2.3703)  loss_objectness: 0.0034 (0.0160)  loss_rpn_box_reg: 0.0029 (0.0037)  pixDist: 5.9424 (17.9660)  model_time: 0.1999 (0.2264) 

It can be seen that the training loss has very large value for example “poch: [198] [414/415]” the loss is “2.3009 (60186394.5416)”. I don’t know the meaning of the numbers in the brackets and out the brackets. I guess the one outside the brackets “2.3009” is the current loss; and the number inside the brackets “(60186394.5416)” looks like an average value or top value. It is about 200 epochs. It sitll has the large value. I guess maybe other freezing layers of the backbone may generate the problem. I tried to set the “backbone trainable layer” to 2. Then, I use learning rate 0.01. However, the keypoint RCNN train to the 50th epoch, then it shows the loss is NaN following by the training stop. Does anyone knows the reason?