Question about pytorch Nan

  1. Can you list what operations will cause Nan in forward and backward pass (e.g. inf-inf)?
  2. How to detect Nan and avoid it (e.g. add if judgment in the program or detect_anomaly)?
  1. Quite lots of operations can produce nan. In summary, if we can’t tell the result with the operation, that would probably produce nan. e.g.
inf - inf -> nan
(-inf) - (-inf) -> nan
inf / inf -> nan
0./0. -> nan
# inf plus inf is still inf
inf + inf -> inf
# inf plus any non-infinite number is still inf
inf + 1 -> inf
# inf divived by zero is still inf
inf / 0 -> inf
  1. Enable detect_anomaly() is OK in debugging, but this would lead to performance degration. torch.isnan() can be used to tell whether a tensor is nan.
    Avoiding nan is a rather complex topic. nan could be caused by problematic datas, numerical error (usually in mixed precison training) or code-related bugs. To avoid nan, I think you should first find out why nan occurs.