Glad it’s working, and an interesting discovery. However, I don’t see the issue as solved yet. I think we can make it work better for your model immediately, and also help prevent this issue for future users.

In our experience `GradScaler`

's default constructor values rarely need to be tuned. Yours is the first case I’m aware of with the native API, and we tried it with 40ish models spanning many applications before merging. The intention was to supply default values (and a dynamic scale-finding heuristic) that are effective for the vast majority of networks, so GradScaler’s args don’t become additional “hyperparameters.”

The default `init_scale`

is intended to be larger than the network initially needs. The large initial value causes inf/nan gradients for the first few iterations but quickly calibrates down to a successful value (because it’s reduced by `backoff_factor`

each time). After that, the large `growth_interval`

means few iterations should be skipped, and the effect on performance is negligible.

In your case, it appears you’re in the opposite situation: the default `init_scale`

is smaller than you initially need. `growth_interval=10`

is one way to increase the scale more quickly than it otherwise would, but once the value calibrates/stabilizes, roughly 1 out of 10 iterations will be skipped (a 10% training slowdown). You work around this by resetting `growth_interval`

later, which is smart, but also inconvenient and not obvious. If all you need is a higher initial value, I’d construct `GradScaler(init_scale=<bigger value>)`

instead of playing with `growth_interval`

in multiple places.

Per above paragraphs, the best practice is to supply an `init_scale`

that’s larger than your network needs, so the scale quickly calibrates down, then stabilizes. To do that, we need to figure out the value it calibrates to. **Can you rerun your existing code (with **`growth_interval=10`

) and print `scaler.get_scale()`

just after `scaler.update()`

for the first few dozen steps to get a sense for the scale value it finds, and post the results here?

For you, the best `init_scale`

would then be the next-greatest power of two* above the value it finds, and you can then ignore `growth_interval`

. For me, the value it finds justifies a PR to increase the default `init_scale`

, reducing the likelihood of this issue in the future. A larger initial scale value doesn’t do much harm for any network (worst case, it causes a few more iterations at the beginning to be skipped).

(*Powers of two are best for for `init_scale`

, `growth_factor`

, and `backoff_factor`

because multiplication/division by powers of two is a bitwise accurate operation on non-denormal IEEE floats.)