Getting Nan Loss when using wide_resnet_fpn or resnext_fpn bacbone with FasterRCNN

Sahil_Goyal · June 24, 2021, 5:29pm

Nan loss appears only in the case of using wide_resnet_fpn or Resnext_fpn as a backbone whereas classic resnets with fpn are working properly as backbone in FRCNN. But the pytorch-vision has mentioned that we can use all of them in the below model . Any idea?

Error

Epoch: [0]  [  0/457]  eta: 0:26:59  lr: 0.000032  loss: 2.0617 (2.0617)  loss_classifier: 0.6785 (0.6785)  loss_box_reg: 0.5333 (0.5333)  loss_objectness: 0.6902 (0.6902)  loss_rpn_box_reg: 0.1597 (0.1597)  time: 3.5431  data: 0.6946  max mem: 6346
Loss is nan, stopping training
{'loss_classifier': tensor(nan, device='cuda:2', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(nan, device='cuda:2', grad_fn=<DivBackward0>), 'loss_objectness': tensor(nan, device='cuda:2', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(nan, device='cuda:2', grad_fn=<DivBackward0>)}

Model

from torchvision.models.detection.retinanet import  retinanet_resnet50_fpn
from torchvision.models.detection import FasterRCNN

def FRCNN_resnetfpn_backbone(backbone_name='resnet101', pre_trained=True):

    # Reference: https://github.com/pytorch/vision/blob/master/torchvision/models/detection/backbone_utils.py

    backbone = resnet_fpn_backbone(backbone_name, pre_trained)
    """
    resnet_fpn_bacbone:
    Args:
        backbone_name (string): resnet architecture. Possible values are 'ResNet', 'resnet18', 'resnet34', 'resnet50',
             'resnet101', 'resnet152', 'resnext50_32x4d', 'resnext101_32x8d', 'wide_resnet50_2', 'wide_resnet101_2'
        pretrained (bool): If True, returns a model with backbone pre-trained on Imagenet
        norm_layer (torchvision.ops): it is recommended to use the default value. For details visit:
            (https://github.com/facebookresearch/maskrcnn-benchmark/issues/267)
        trainable_layers (int): number of trainable (not frozen) resnet layers starting from final block.
            Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable. default=3
        returned_layers (list of int): The layers of the network to return. Each entry must be in ``[1, 4]``.
            By default all layers are returned.
        extra_blocks (ExtraFPNBlock or None): if provided, extra operations will
            be performed. It is expected to take the fpn features, the original
            features and the names of the original features as input, and returns
            a new list of feature maps and their corresponding names. By
            default a ``LastLevelMaxPool`` is used.
    """
    model = FasterRCNN(backbone,
                       num_classes=2)
    return model

Thanks a lot! help!

Sahil_Goyal · June 25, 2021, 1:59pm

Actually, it is related to the version of torchvision. Earlier I was using 0.8.0a0, changing to 0.9.0a0 resolved the issue