GradScaler decrease the scale suddenly on AMP multi-card training

OS: ubuntu_22.04
python: 3.10.12
CUDA: H800, CUDA:12.1
torch: 2.4.1+cu121
quesiton detail:
This is maybe a bug for amp.GradScaler? I am not sure.
I have a model for AMP training. When it is trained with a higher lr [eg. 1e-3] on a single card, it runs fine. but when I train it on a multiple card runing, it will get None grad.
I had followed the scale value of the GradScaler (init with default), the value can be jumped in some step from 131072.0 to 5.29xxxx e-23, in just one step. accordding to introduction, when some step get Inf grad value, the GradScaler will shuink the scale value to half, that means when it change from 131072.0 to a very small float value, there should be many many training steps. How can this changed happend in just one step? Is this a bug? or some of my misunderstanding?

Do you see the same issue using a newer PyTorch release, e.g. 2.7.1 or a nightly binary?