Hi!
I have come across a Mixed Precision issue while testing the AdaHessian optimizer (https://arxiv.org/abs/2006.00719), a second-order optimizer, which I think is arising out of Gradscaler.unscale_
This colab working example shows the results of using unscale_
or just manually unscaling
AdaHessian
A core element AdaHessian is calculating hvs
as below after scaling up the loss and performing the backward pass. The gradients need to be unscaled before being passed to autograd.grad
.
hvs = torch.autograd.grad(gradsH, params, grad_outputs=v, only_inputs=True, retain_graph=False)
Using GradScaler.unscale_ produces nan
However despite manually verifying (printing min/mean of p.grad) that the gradients being passed to autograd.grad
have been scaled down via scaler.unscale_(optimizer)
it seems like autograd.grad
doesnāt ārespondā(?) to this unscaling and subsequently produces nan
or very high values for hvs
Manual unscaling works
However if, after gathering the gradients from the optimizer, I manually unscale them using the same scaling factor used in GradScaler
, the calculation of hvs
works as expected
Pseudo code
(see colab link for full version)
# AUTOCAST
with autocast(enabled=fp16_enabled):
output = net(input)
loss = criterion(output, target)
# SCALE UP
scaler.scale(loss).backward(create_graph=True)
# SCALE DOWN
if not manual_unscale:
scaler.unscale_(optimizer) # UNSCALE GRADS WITH GRADSCALER
# GATHER GRADSH AND PARAMS
gradsH = []
params = optimizer.param_groups[0]['params']
for p in optimizer.param_groups[0]['params']:
gradsH.append(0. if p.grad is None else p.grad + 0.)
# MANUALLY UNSCALE (if manual_unscale==True)
if manual_unscale:
gradscaler_scale = scaler.get_scale()
for g in gradsH:
g.div_(gradscaler_scale/scale_reduction)
# CALCULATE hvs
hvs = torch.autograd.grad(gradsH, params, grad_outputs=v, only_inputs=True, retain_graph=False)
Question
Does using GradScaler.unscale_
to unscale the gradients mean that this unscaling operation is not "recorded* " in the graph? Whereas gathering the gradients from the optimizer and manually unscaling them is ārecordedā?
*(sorry not sure what the correct terminology is here)
Is this behaviour expected and is there a workaround to unscaling that involves using GradScaler
or should I just stick with manually unscaling the gradients i have gathered?
Comment
(I know that if I manually unscale then that unscaling will not be recorded by GradScaler
and the optimizer gradients will be subsequently unscaled in GradScaler.step
, however this should be ok I think as I will have manually unscaled a new list of gradients and the gradients in the optimizer will still be untouched by this manual scaling)
Thanks for reading this far, looking forward to hearing what anyone thinks!