Numerical error on A100 GPUs

  1. I don’t know which input data range you are using, but based on the errors I would guess they are caused by the TF32 numerical precision. Could you disable it and recheck the results via:
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False
  1. Could you share the model and describe your use case a bit more, please? The assumption is that no convergence issues are caused by TF32, but your use case sounds concerning.

  2. Check point 1. Disabling TF32 would still yield a performance drop but might still be faster overall than dropping to an older PyTorch release (with older CUDA libraries).

2 Likes