NaN Gradient/Loss for Quadro RTX 8000 only

Hi all!

I am currently training different diffusion models by using the [Imagen-pytorch] repository from Phil Wang, which works super fine when trained on a Nvidia A6000 GPU of a colleague. When trained on my Quadro RTX 8000 I do get nan Losses caused by nan gradients.

Setups I experimented with:
GPU: A6000
Nvidia Driver Version: 525.125.06
Cuda Version: 12.0
torch: 1.13.1
→ Everything works fine

GPU: Quadro RTX 8000
Nvidia Driver Version: 525.125.06
Cuda Version: 12.0
torch: 1.13.1
→ NaN Losses/ Gradients

GPU: Quadro RTX 8000
Nvidia Driver Version: 470.223.02
Cuda Version: 11.4
torch: 1.12.1
→ NaN Losses/ Gradients

Other things I tried:
As stated in the [pytorch docs] I suspected to be something wrong with the AMP and GradScaler which I disabled both in the codebase. Still the NaN losses persist.

Using torch.autograd.set_detect_anomaly(True) the following error is thrown:

RuntimeError: Function 'MmBackward0' returned nan values in its 0th output.

At this point I am not sure why this is hardware specific and only the case for the Quadro RTX 8000 gpu. Maybe someone can shed some light :sun_with_face:

(Automatic Mixed Precision — PyTorch Tutorials 2.2.0+cu121 documentation)
(GitHub - lucidrains/imagen-pytorch: Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch)

All used PyTorch binaries are old by now, so update to the latest stable or nightly release and rerun the tests.

I updated the complete setup to have:

GPU: Quadro RTX8000
Nvidia Driver Version: 545.29.06
Cuda Version: 12.3
torch: 2.1.2

And now it does work. Never found out what the exact error was though. Still thanks for your help :blush: