Numerical error on A100 GPUs

ptrblck · April 1, 2022, 8:20pm

I don’t know which input data range you are using, but based on the errors I would guess they are caused by the TF32 numerical precision. Could you disable it and recheck the results via:

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

Could you share the model and describe your use case a bit more, please? The assumption is that no convergence issues are caused by TF32, but your use case sounds concerning.
Check point 1. Disabling TF32 would still yield a performance drop but might still be faster overall than dropping to an older PyTorch release (with older CUDA libraries).