I am training a u-net for image segmentation using Lightning.
In trainig_step
I check gradients via
for name, param in self.model.named_parameters():
if param.grad is not None and not torch.isfinite(param.grad).all():
log.info(f" ------ gradient of model param {name} is finite={torch.isfinite(param.grad).all()}, has nan={torch.isnan(param).any()} ------ ")
and i am seeing invalid gradients right at the beginning of training. However, this disappears after a few iterations and the model continue to converge. Also I enable detect_anomaly
but the invalid gradients are not detected. For the context I am doing fp16
training. With fp32
there are less occurrence of invalid gradients it seems.
I am wondering why the invalid gradients didn’t affect the training at all…