Glad it’s working, and an interesting discovery. However, I don’t see the issue as solved yet. I think we can make it work better for your model immediately, and also help prevent this issue for future users.
In our experience
GradScaler's default constructor values rarely need to be tuned. Yours is the first case I’m aware of with the native API, and we tried it with 40ish models spanning many applications before merging. The intention was to supply default values (and a dynamic scale-finding heuristic) that are effective for the vast majority of networks, so GradScaler’s args don’t become additional “hyperparameters.”
init_scale is intended to be larger than the network initially needs. The large initial value causes inf/nan gradients for the first few iterations but quickly calibrates down to a successful value (because it’s reduced by
backoff_factor each time). After that, the large
growth_interval means few iterations should be skipped, and the effect on performance is negligible.
In your case, it appears you’re in the opposite situation: the default
init_scale is smaller than you initially need.
growth_interval=10 is one way to increase the scale more quickly than it otherwise would, but once the value calibrates/stabilizes, roughly 1 out of 10 iterations will be skipped (a 10% training slowdown). You work around this by resetting
growth_interval later, which is smart, but also inconvenient and not obvious. If all you need is a higher initial value, I’d construct
GradScaler(init_scale=<bigger value>) instead of playing with
growth_interval in multiple places.
Per above paragraphs, the best practice is to supply an
init_scale that’s larger than your network needs, so the scale quickly calibrates down, then stabilizes. To do that, we need to figure out the value it calibrates to. Can you rerun your existing code (with
growth_interval=10) and print
scaler.get_scale() just after
scaler.update() for the first few dozen steps to get a sense for the scale value it finds, and post the results here?
For you, the best
init_scale would then be the next-greatest power of two* above the value it finds, and you can then ignore
growth_interval. For me, the value it finds justifies a PR to increase the default
init_scale, reducing the likelihood of this issue in the future. A larger initial scale value doesn’t do much harm for any network (worst case, it causes a few more iterations at the beginning to be skipped).
(*Powers of two are best for for
backoff_factor because multiplication/division by powers of two is a bitwise accurate operation on non-denormal IEEE floats.)