Is the cuda.amp.GradScaler capable of dealing with two differently scaled sets of gradients?

zhi_Li · May 17, 2020, 4:24am

I found a case In CUDA Automatic Mixed Precision examples — PyTorch master documentation.

If your network has multiple losses, you must call [ scaler.scale ] on each of them individually.
…
scaler.scale(loss0).backward(retain_graph=True)
scaler.scale(loss1).backward()

It seems that the gradients back propogated from the two losses that scaled by different scaling parameters are just simply accumulated. Can the scaler unscales the two set of gradients with their corresponding scaling parameters when it comes to parameter updating?

ptrblck · May 17, 2020, 6:58am

The losses are not accumulated, as the backward call is used on each of the losses separately.

You can unscale the gradients implicitly using:

# You can choose which optimizers receive explicit unscaling, if you
# want to inspect or modify the gradients of the params they own.
scaler.unscale_(optimizer0)

zhi_Li · May 17, 2020, 7:31am

Sorry, I have made some wrong expressions in the question. I mean the gradients back propagated from the two losses seems to be simply accumulated. Can the Scaler tells the two set of gradients apart, and unscales each of them properly for param. updating?

ptrblck · May 17, 2020, 7:33am

Ah OK, thanks for the follow-up.
Yes, the scaler will unscale the gradients before accumulating them.

zhi_Li · May 17, 2020, 7:46am

Thank you for your kind help! This is a powerful feature.