Hi all!
I am currently training different diffusion models by using the [Imagen-pytorch] repository from Phil Wang, which works super fine when trained on a Nvidia A6000 GPU of a colleague. When trained on my Quadro RTX 8000 I do get nan Losses caused by nan gradients.
Setups I experimented with:
GPU: A6000
Nvidia Driver Version: 525.125.06
Cuda Version: 12.0
torch: 1.13.1
→ Everything works fine
GPU: Quadro RTX 8000
Nvidia Driver Version: 525.125.06
Cuda Version: 12.0
torch: 1.13.1
→ NaN Losses/ Gradients
GPU: Quadro RTX 8000
Nvidia Driver Version: 470.223.02
Cuda Version: 11.4
torch: 1.12.1
→ NaN Losses/ Gradients
Other things I tried:
As stated in the [pytorch docs] I suspected to be something wrong with the AMP and GradScaler which I disabled both in the codebase. Still the NaN losses persist.
Using torch.autograd.set_detect_anomaly(True)
the following error is thrown:
RuntimeError: Function 'MmBackward0' returned nan values in its 0th output.
At this point I am not sure why this is hardware specific and only the case for the Quadro RTX 8000 gpu. Maybe someone can shed some light
(Automatic Mixed Precision — PyTorch Tutorials 2.2.0+cu121 documentation)
(GitHub - lucidrains/imagen-pytorch: Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch)