@ptrblck Just checked the documentation of the GradScaler
class and found this:
The scale factor often causes infs/NaNs to appear in gradients for the first few iterations as its value calibrates.
scaler.step
will skip the underlyingoptimizer.step()
for these iterations. After that, step skipping should occur rarely (once every few hundred or thousand iterations).
Could this be the cause for such warnings?
And another question: do you get the scale factor using scaler.get_scale()
where scaler
is an instance of torch.cuda.amp.GradScaler
?