Also, what does “a few iterations” mean here? It’s expected that scaler.step(optimizer)
may skip the first few steps due to inf/nan gradients as the scale value calibrates, so the loss would not decrease for those iterations.
1 Like