If your network has multiple losses, you must call [ scaler.scale ] on each of them individually.
…
scaler.scale(loss0).backward(retain_graph=True)
scaler.scale(loss1).backward()
It seems that the gradients back propogated from the two losses that scaled by different scaling parameters are just simply accumulated. Can the scaler unscales the two set of gradients with their corresponding scaling parameters when it comes to parameter updating?
The losses are not accumulated, as the backward call is used on each of the losses separately.
You can unscale the gradients implicitly using:
# You can choose which optimizers receive explicit unscaling, if you
# want to inspect or modify the gradients of the params they own.
scaler.unscale_(optimizer0)
Sorry, I have made some wrong expressions in the question. I mean the gradients back propagated from the two losses seems to be simply accumulated. Can the Scaler tells the two set of gradients apart, and unscales each of them properly for param. updating?