Glad it’s working, and an interesting discovery. However, I don’t see the issue as solved yet. I think we can make it work better for your model immediately, and also help prevent this issue for future users.
In our experience GradScaler
's default constructor values rarely need to be tuned. Yours is the first case I’m aware of with the native API, and we tried it with 40ish models spanning many applications before merging. The intention was to supply default values (and a dynamic scale-finding heuristic) that are effective for the vast majority of networks, so GradScaler’s args don’t become additional “hyperparameters.”
The default init_scale
is intended to be larger than the network initially needs. The large initial value causes inf/nan gradients for the first few iterations but quickly calibrates down to a successful value (because it’s reduced by backoff_factor
each time). After that, the large growth_interval
means few iterations should be skipped, and the effect on performance is negligible.
In your case, it appears you’re in the opposite situation: the default init_scale
is smaller than you initially need. growth_interval=10
is one way to increase the scale more quickly than it otherwise would, but once the value calibrates/stabilizes, roughly 1 out of 10 iterations will be skipped (a 10% training slowdown). You work around this by resetting growth_interval
later, which is smart, but also inconvenient and not obvious. If all you need is a higher initial value, I’d construct GradScaler(init_scale=<bigger value>)
instead of playing with growth_interval
in multiple places.
Per above paragraphs, the best practice is to supply an init_scale
that’s larger than your network needs, so the scale quickly calibrates down, then stabilizes. To do that, we need to figure out the value it calibrates to. Can you rerun your existing code (with growth_interval=10
) and print scaler.get_scale()
just after scaler.update()
for the first few dozen steps to get a sense for the scale value it finds, and post the results here?
For you, the best init_scale
would then be the next-greatest power of two* above the value it finds, and you can then ignore growth_interval
. For me, the value it finds justifies a PR to increase the default init_scale
, reducing the likelihood of this issue in the future. A larger initial scale value doesn’t do much harm for any network (worst case, it causes a few more iterations at the beginning to be skipped).
(*Powers of two are best for for init_scale
, growth_factor
, and backoff_factor
because multiplication/division by powers of two is a bitwise accurate operation on non-denormal IEEE floats.)