Torchvision Mask-rcnn with Resnext101 backbone occur Nan loss during the training

Hi!
When I train mask rcnn with resnext101 backbone, Loss goes to Nan.
My code is made from pytorch tutorial.

backbone = resnet_fpn_backbone('resnext101_32x8d', pretrained=True, trainable_layers=5)
model = MaskRCNN(backbone, num_classes)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
                                                       hidden_layer,
                                                       num_classes)

error message :

Epoch: [0]  [   0/1388]  eta: 0:23:07  lr: 0.000010  loss: 2.0952 (2.0952)  loss_classifier: 0.6831 (0.6831)  loss_box_reg: 0.0190 (0.0190)  loss_mask: 0.6889 (0.6889)  loss_objectness: 0.6918 (0.6918)  loss_rpn_box_reg: 0.0123 (0.0123)  time: 0.9999  data: 0.3882  max mem: 0
Loss is nan, stopping training
{'loss_classifier': tensor(nan, device='cuda:1', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(nan, device='cuda:1', grad_fn=<DivBackward0>), 'loss_mask': tensor(nan, device='cuda:1', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_objectness': tensor(0.6931, device='cuda:1', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(0.0050, device='cuda:1', grad_fn=<DivBackward0>)}

When I change the backbone to resnet50 or resnet101, no error occurs.

I try to reduce lr but error still occurs with resnext-like.
How can I solve it?

You could check the forward activations for invalid values via forward hooks as described here. Once you’ve isolated which layer creates the NaN outputs, check it’s inputs as well as parameters.
If the parameters show invalid values, most likely the gradients were too large, the model was diverging, and the parameters were overflowing. On the other hand, if the inputs are containing NaNs, check the previous operation and see, if/how it could create invalid values.

As you saying, I checked the outputs of layers, but I didn’t find any invalid values. Instead, I found loss is very large value on first epoch, first batch:

{'loss_classifier': tensor(2.4077e+09, grad_fn=<NllLossBackward>),
 'loss_box_reg': tensor(2.9896e+09, grad_fn=<DivBackward0>),
 'loss_mask': tensor(8.4899e+10, grad_fn=<BinaryCrossEntropyWithLogitsBackward>),
 'loss_objectness': tensor(2.1843e+09, grad_fn=<BinaryCrossEntropyWithLogitsBackward>),
 'loss_rpn_box_reg': tensor(1.9489e+08, grad_fn=<DivBackward0>)}

After update, in second batch, losses go to Nan.

It can be clue of solving the problem?

p.s. I use SGD optimizer, lr = 0.0005 with warm-up starting.

Your model seems to be diverging. Since the forward activations seem to be in an expected range, you could check the loss function, which seems to blow up the values.

Thanks your advice. But I still didn’t understand about that why MaskRcnn with ResNet backbone is okay but that with ResNext is invalid. This Nan always occurs with any dataset including COCO2017 when ResNext or Wide-Resnet is used as backbone. I just got the backbone by using vision/torchvision/models/detection/backbone_utils.py, and changed the default Mask Rcnn provided from torchvision. Is this a possible situation?

I don’t know, why the change in backbone would cause this issue, but based on your last post:

it seems that the output activations of the backbone look alright, while the loss is really high (or were the output activations already high?).
In the latter case, I would guess that your current hyperparameters are not suitable for the new backbone and let the model diverge.