Question about pytorch Nan

ZiQing_Li · March 9, 2022, 2:35am

Can you list what operations will cause Nan in forward and backward pass (e.g. inf-inf)?
How to detect Nan and avoid it (e.g. add if judgment in the program or detect_anomaly)?

huahuanZ · March 9, 2022, 8:14am

Quite lots of operations can produce nan. In summary, if we can’t tell the result with the operation, that would probably produce nan. e.g.

inf - inf -> nan
(-inf) - (-inf) -> nan
inf / inf -> nan
0./0. -> nan
# inf plus inf is still inf
inf + inf -> inf
# inf plus any non-infinite number is still inf
inf + 1 -> inf
# inf divived by zero is still inf
inf / 0 -> inf

Enable detect_anomaly() is OK in debugging, but this would lead to performance degration. torch.isnan() can be used to tell whether a tensor is nan.
Avoiding nan is a rather complex topic. nan could be caused by problematic datas, numerical error (usually in mixed precison training) or code-related bugs. To avoid nan, I think you should first find out why nan occurs.