Hello,
I am interested in the Keypoint RCNN in the Pytorch function from here. I understand this using Mask RCNN to get kepionts of human body. The link above shows how to use “resnet50” with FPN to combine the keypoint RCNN to detect keypoints. I try to change othe backbones on function of "resnet_fpn_backbone()’. I have tried “resnet18”, “wide_resnet50_2”, “wide_resnet101_2” and “resnext50_32x4d”. I change the backbone and using the pretrained weighs. My training plan is below:
- learning rate: 0.02
- learning rate schedule: 60 epoch later reduce 1/10
- backbone trainable layer: 1
- backbone using pre-trained
- train 200 epochs
According to my plan, I train the keypoint head, keypoint predictor and backbone layer 4. The “resnet18”, “wide_resnet50_2” and “wide_resnet101_2” are working. I can see the loss going down and the inference results also good. However, I got a problem on “resnext50_32x4d”. The training loss always very large. I show the part of training matrix below:
Epoch: [198] [ 0/415] eta: 0:57:42 lr: 0.000020 loss: 2.1488 (2.1488) loss_classifier: 0.0365 (0.0365) loss_box_reg: 0.0660 (0.0660) loss_keypoint: 2.0387 (2.0387) loss_objectness: 0.0062 (0.0062) loss_rpn_box_reg: 0.0014 (0.0014) backbone_lr: 0.0000 (0.0000) time: 8.3441 data: 7.0423 max mem: 0
Epoch: [198] [400/415] eta: 0:00:17 lr: 0.000020 loss: 2.0049 (62287665.0680) loss_classifier: 0.0461 (0.0554) loss_box_reg: 0.0668 (0.0695) loss_keypoint: 1.8585 (62287477.6360) loss_objectness: 0.0057 (62.1232) loss_rpn_box_reg: 0.0041 (125.3357) backbone_lr: 0.0000 (0.0000) time: 1.1544 data: 0.0172 max mem: 0
Epoch: [198] [414/415] eta: 0:00:01 lr: 0.000020 loss: 2.3009 (60186394.5416) loss_classifier: 0.0575 (0.0555) loss_box_reg: 0.0803 (0.0699) loss_keypoint: 2.1628 (60186213.4271) loss_objectness: 0.0120 (60.0281) loss_rpn_box_reg: 0.0050 (121.1077) backbone_lr: 0.0000 (0.0000) time: 1.1668 data: 0.0181 max mem: 0
Epoch: [198] Total time: 0:08:04 (1.1677 s / it)
Validation: [ 0/100] eta: 0:07:19 loss: 2.4968 (2.4968) loss_classifier: 0.0300 (0.0300) loss_box_reg: 0.0487 (0.0487) loss_keypoint: 2.4162 (2.4162) loss_objectness: 0.0009 (0.0009) loss_rpn_box_reg: 0.0009 (0.0009) pixDist: 1.6672 (1.6672) model_time: 0.3858 (0.3858) time: 4.3959 data: 3.9930 max mem: 0
Validation: [ 99/100] eta: 0:00:00 loss: 1.9306 (2.4355) loss_classifier: 0.0234 (0.0530) loss_box_reg: 0.0307 (0.0636) loss_keypoint: 1.8603 (2.2977) loss_objectness: 0.0025 (0.0165) loss_rpn_box_reg: 0.0029 (0.0048) pixDist: 5.9226 (18.6548) model_time: 0.2072 (0.2299) time: 0.2167 data: 0.0010 max mem: 0
Validation: Total time: 0:00:28 (0.2815 s / it)
Averaged stats: loss: 1.9306 (2.4355) loss_classifier: 0.0234 (0.0530) loss_box_reg: 0.0307 (0.0636) loss_keypoint: 1.8603 (2.2977) loss_objectness: 0.0025 (0.0165) loss_rpn_box_reg: 0.0029 (0.0048) pixDist: 5.9226 (18.6548) model_time: 0.2072 (0.2299)
Epoch: [199] [ 0/415] eta: 0:52:42 lr: 0.000020 loss: 1.6407 (1.6407) loss_classifier: 0.0340 (0.0340) loss_box_reg: 0.0583 (0.0583) loss_keypoint: 1.5362 (1.5362) loss_objectness: 0.0035 (0.0035) loss_rpn_box_reg: 0.0086 (0.0086) backbone_lr: 0.0000 (0.0000) time: 7.6195 data: 6.4353 max mem: 0
Epoch: [199] [400/415] eta: 0:00:17 lr: 0.000020 loss: 2.2163 (49125692.3149) loss_classifier: 0.0541 (933653.1231) loss_box_reg: 0.0530 (0.0693) loss_keypoint: 2.0583 (48192017.1261) loss_objectness: 0.0131 (11.7248) loss_rpn_box_reg: 0.0035 (10.2920) backbone_lr: 0.0000 (0.0000) time: 1.1280 data: 0.0092 max mem: 0
Epoch: [199] [414/415] eta: 0:00:01 lr: 0.000020 loss: 2.1601 (47468440.1186) loss_classifier: 0.0441 (902156.3929) loss_box_reg: 0.0660 (0.0693) loss_keypoint: 2.0583 (46566262.4014) loss_objectness: 0.0084 (11.3297) loss_rpn_box_reg: 0.0036 (9.9450) backbone_lr: 0.0000 (0.0000) time: 1.1506 data: 0.0082 max mem: 0
Epoch: [199] Total time: 0:08:02 (1.1627 s / it)
Validation: [ 0/100] eta: 0:07:16 loss: 2.3626 (2.3626) loss_classifier: 0.0189 (0.0189) loss_box_reg: 0.0500 (0.0500) loss_keypoint: 2.2917 (2.2917) loss_objectness: 0.0015 (0.0015) loss_rpn_box_reg: 0.0005 (0.0005) pixDist: 1.6832 (1.6832) model_time: 0.3154 (0.3154) time: 4.3678 data: 4.0433 max mem: 0
Validation: [ 99/100] eta: 0:00:00 loss: 1.5503 (2.5065) loss_classifier: 0.0220 (0.0534) loss_box_reg: 0.0319 (0.0631) loss_keypoint: 1.4727 (2.3703) loss_objectness: 0.0034 (0.0160) loss_rpn_box_reg: 0.0029 (0.0037) pixDist: 5.9424 (17.9660) model_time: 0.1999 (0.2264) time: 0.2159 data: 0.0010 max mem: 0
Validation: Total time: 0:00:27 (0.2777 s / it)
Averaged stats: loss: 1.5503 (2.5065) loss_classifier: 0.0220 (0.0534) loss_box_reg: 0.0319 (0.0631) loss_keypoint: 1.4727 (2.3703) loss_objectness: 0.0034 (0.0160) loss_rpn_box_reg: 0.0029 (0.0037) pixDist: 5.9424 (17.9660) model_time: 0.1999 (0.2264)
It can be seen that the training loss has very large value for example “poch: [198] [414/415]” the loss is “2.3009 (60186394.5416)”. I don’t know the meaning of the numbers in the brackets and out the brackets. I guess the one outside the brackets “2.3009” is the current loss; and the number inside the brackets “(60186394.5416)” looks like an average value or top value. It is about 200 epochs. It sitll has the large value. I guess maybe other freezing layers of the backbone may generate the problem. I tried to set the “backbone trainable layer” to 2. Then, I use learning rate 0.01. However, the keypoint RCNN train to the 50th epoch, then it shows the loss is NaN following by the training stop. Does anyone knows the reason?