You could check the forward activations for invalid values via forward hooks as described here. Once you’ve isolated which layer creates the NaN outputs, check it’s inputs as well as parameters.
If the parameters show invalid values, most likely the gradients were too large, the model was diverging, and the parameters were overflowing. On the other hand, if the inputs are containing NaNs, check the previous operation and see, if/how it could create invalid values.
As you saying, I checked the outputs of layers, but I didn’t find any invalid values. Instead, I found loss is very large value on first epoch, first batch:
Your model seems to be diverging. Since the forward activations seem to be in an expected range, you could check the loss function, which seems to blow up the values.
Thanks your advice. But I still didn’t understand about that why MaskRcnn with ResNet backbone is okay but that with ResNext is invalid. This Nan always occurs with any dataset including COCO2017 when ResNext or Wide-Resnet is used as backbone. I just got the backbone by using vision/torchvision/models/detection/backbone_utils.py, and changed the default Mask Rcnn provided from torchvision. Is this a possible situation?
I don’t know, why the change in backbone would cause this issue, but based on your last post:
it seems that the output activations of the backbone look alright, while the loss is really high (or were the output activations already high?).
In the latter case, I would guess that your current hyperparameters are not suitable for the new backbone and let the model diverge.