RuntimeError: Function 'MulBackward0' returned nan values in its 0th output

tianq01 · August 28, 2020, 9:51am

hi.
yes. the zero out is done in mmcv package after_train_iter call:
see https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/optimizer.py

learning rate and optimizer setting are also ok. As mentioned, the training with the same config using tensorflow implementation does not NaN.

ps we tried clip gradients. not working.
anyway the problem is not so easy to repo, also the behavior is not totally the same for each repro. e.g. in another repo, the value does become bigger and bigger, but not so big – compared to 1.8035e+25 in above screenshot – then nan occurs.

any suggestion?
print tensor gradients in backward phase using hook? what to do next if we find gradients anomalous somewhere…I suppose faster-rcnn_fpn implementation should be already stable in mmdetection…