Training ResneXt50_32x4d backbone in KeypointRCNN has large loss value?

Hello,

I am interested in the Keypoint RCNN in the Pytorch function from here. I understand this using Mask RCNN to get kepionts of human body. The link above shows how to use “resnet50” with FPN to combine the keypoint RCNN to detect keypoints. I try to change othe backbones on function of "resnet_fpn_backbone()’. I have tried “resnet18”, “wide_resnet50_2”, “wide_resnet101_2” and “resnext50_32x4d”. I change the backbone and using the pretrained weighs. My training plan is below:

  • learning rate: 0.02
  • learning rate schedule: 60 epoch later reduce 1/10
  • backbone trainable layer: 1
  • backbone using pre-trained
  • train 200 epochs

According to my plan, I train the keypoint head, keypoint predictor and backbone layer 4. The “resnet18”, “wide_resnet50_2” and “wide_resnet101_2” are working. I can see the loss going down and the inference results also good. However, I got a problem on “resnext50_32x4d”. The training loss always very large. I show the part of training matrix below:

Epoch: [198]  [  0/415]  eta: 0:57:42  lr: 0.000020  loss: 2.1488 (2.1488)  loss_classifier: 0.0365 (0.0365)  loss_box_reg: 0.0660 (0.0660)  loss_keypoint: 2.0387 (2.0387)  loss_objectness: 0.0062 (0.0062)  loss_rpn_box_reg: 0.0014 (0.0014)  backbone_lr: 0.0000 (0.0000)  time: 8.3441  data: 7.0423  max mem: 0
Epoch: [198]  [400/415]  eta: 0:00:17  lr: 0.000020  loss: 2.0049 (62287665.0680)  loss_classifier: 0.0461 (0.0554)  loss_box_reg: 0.0668 (0.0695)  loss_keypoint: 1.8585 (62287477.6360)  loss_objectness: 0.0057 (62.1232)  loss_rpn_box_reg: 0.0041 (125.3357)  backbone_lr: 0.0000 (0.0000)  time: 1.1544  data: 0.0172  max mem: 0
Epoch: [198]  [414/415]  eta: 0:00:01  lr: 0.000020  loss: 2.3009 (60186394.5416)  loss_classifier: 0.0575 (0.0555)  loss_box_reg: 0.0803 (0.0699)  loss_keypoint: 2.1628 (60186213.4271)  loss_objectness: 0.0120 (60.0281)  loss_rpn_box_reg: 0.0050 (121.1077)  backbone_lr: 0.0000 (0.0000)  time: 1.1668  data: 0.0181  max mem: 0
Epoch: [198] Total time: 0:08:04 (1.1677 s / it)
Validation:  [  0/100]  eta: 0:07:19  loss: 2.4968 (2.4968)  loss_classifier: 0.0300 (0.0300)  loss_box_reg: 0.0487 (0.0487)  loss_keypoint: 2.4162 (2.4162)  loss_objectness: 0.0009 (0.0009)  loss_rpn_box_reg: 0.0009 (0.0009)  pixDist: 1.6672 (1.6672)  model_time: 0.3858 (0.3858)  time: 4.3959  data: 3.9930  max mem: 0
Validation:  [ 99/100]  eta: 0:00:00  loss: 1.9306 (2.4355)  loss_classifier: 0.0234 (0.0530)  loss_box_reg: 0.0307 (0.0636)  loss_keypoint: 1.8603 (2.2977)  loss_objectness: 0.0025 (0.0165)  loss_rpn_box_reg: 0.0029 (0.0048)  pixDist: 5.9226 (18.6548)  model_time: 0.2072 (0.2299)  time: 0.2167  data: 0.0010  max mem: 0
Validation: Total time: 0:00:28 (0.2815 s / it)
Averaged stats: loss: 1.9306 (2.4355)  loss_classifier: 0.0234 (0.0530)  loss_box_reg: 0.0307 (0.0636)  loss_keypoint: 1.8603 (2.2977)  loss_objectness: 0.0025 (0.0165)  loss_rpn_box_reg: 0.0029 (0.0048)  pixDist: 5.9226 (18.6548)  model_time: 0.2072 (0.2299)
Epoch: [199]  [  0/415]  eta: 0:52:42  lr: 0.000020  loss: 1.6407 (1.6407)  loss_classifier: 0.0340 (0.0340)  loss_box_reg: 0.0583 (0.0583)  loss_keypoint: 1.5362 (1.5362)  loss_objectness: 0.0035 (0.0035)  loss_rpn_box_reg: 0.0086 (0.0086)  backbone_lr: 0.0000 (0.0000)  time: 7.6195  data: 6.4353  max mem: 0
Epoch: [199]  [400/415]  eta: 0:00:17  lr: 0.000020  loss: 2.2163 (49125692.3149)  loss_classifier: 0.0541 (933653.1231)  loss_box_reg: 0.0530 (0.0693)  loss_keypoint: 2.0583 (48192017.1261)  loss_objectness: 0.0131 (11.7248)  loss_rpn_box_reg: 0.0035 (10.2920)  backbone_lr: 0.0000 (0.0000)  time: 1.1280  data: 0.0092  max mem: 0
Epoch: [199]  [414/415]  eta: 0:00:01  lr: 0.000020  loss: 2.1601 (47468440.1186)  loss_classifier: 0.0441 (902156.3929)  loss_box_reg: 0.0660 (0.0693)  loss_keypoint: 2.0583 (46566262.4014)  loss_objectness: 0.0084 (11.3297)  loss_rpn_box_reg: 0.0036 (9.9450)  backbone_lr: 0.0000 (0.0000)  time: 1.1506  data: 0.0082  max mem: 0
Epoch: [199] Total time: 0:08:02 (1.1627 s / it)
Validation:  [  0/100]  eta: 0:07:16  loss: 2.3626 (2.3626)  loss_classifier: 0.0189 (0.0189)  loss_box_reg: 0.0500 (0.0500)  loss_keypoint: 2.2917 (2.2917)  loss_objectness: 0.0015 (0.0015)  loss_rpn_box_reg: 0.0005 (0.0005)  pixDist: 1.6832 (1.6832)  model_time: 0.3154 (0.3154)  time: 4.3678  data: 4.0433  max mem: 0
Validation:  [ 99/100]  eta: 0:00:00  loss: 1.5503 (2.5065)  loss_classifier: 0.0220 (0.0534)  loss_box_reg: 0.0319 (0.0631)  loss_keypoint: 1.4727 (2.3703)  loss_objectness: 0.0034 (0.0160)  loss_rpn_box_reg: 0.0029 (0.0037)  pixDist: 5.9424 (17.9660)  model_time: 0.1999 (0.2264)  time: 0.2159  data: 0.0010  max mem: 0
Validation: Total time: 0:00:27 (0.2777 s / it)
Averaged stats: loss: 1.5503 (2.5065)  loss_classifier: 0.0220 (0.0534)  loss_box_reg: 0.0319 (0.0631)  loss_keypoint: 1.4727 (2.3703)  loss_objectness: 0.0034 (0.0160)  loss_rpn_box_reg: 0.0029 (0.0037)  pixDist: 5.9424 (17.9660)  model_time: 0.1999 (0.2264) 

It can be seen that the training loss has very large value for example “poch: [198] [414/415]” the loss is “2.3009 (60186394.5416)”. I don’t know the meaning of the numbers in the brackets and out the brackets. I guess the one outside the brackets “2.3009” is the current loss; and the number inside the brackets “(60186394.5416)” looks like an average value or top value. It is about 200 epochs. It sitll has the large value. I guess maybe other freezing layers of the backbone may generate the problem. I tried to set the “backbone trainable layer” to 2. Then, I use learning rate 0.01. However, the keypoint RCNN train to the 50th epoch, then it shows the loss is NaN following by the training stop. Does anyone knows the reason?

Validation loss , how to print