Model converges even when gradients are inf

I am training a u-net for image segmentation using Lightning.
In trainig_step I check gradients via

        for name, param in self.model.named_parameters():
            if param.grad is not None and not torch.isfinite(param.grad).all():
      " ------ gradient of model param {name} is finite={torch.isfinite(param.grad).all()}, has nan={torch.isnan(param).any()} ------ ")

and i am seeing invalid gradients right at the beginning of training. However, this disappears after a few iterations and the model continue to converge. Also I enable detect_anomaly but the invalid gradients are not detected. For the context I am doing fp16 training. With fp32 there are less occurrence of invalid gradients it seems.
I am wondering why the invalid gradients didn’t affect the training at all…

I assume you are mixed-precision training via torch.amp and thus also the GradScaler?
If so, then note that you might be checking the scaled gradients, which might overflow. The GradScaler will then skip the parameter update and decrease the scaling factor as described in the docs.