In PyTorch 2.7 + CUDA 12.8, using AMP causes scaler.step() to throw an error saying that no inf checks were recorded. Why does this happen? How Can I solver it.
With the exact same code, everything works fine on older versions of PyTorch and CUDA.
However, after switching to an RTX 5090 and upgrading to PyTorch 2.7 + CUDA 12.8, I found that:
self.scaler = torch.amp.GradScaler(enabled=self.optim_conf.amp.enabled)
self.scaler.step(optim.optimizer)
now throws the following error:
[rank0]: AssertionError: No inf checks were recorded for this optimizer.