GradScaler.unscale_, autograd.grad and second differentiation

Hi!

I have come across a Mixed Precision issue while testing the AdaHessian optimizer (https://arxiv.org/abs/2006.00719), a second-order optimizer, which I think is arising out of Gradscaler.unscale_

This colab working example shows the results of using unscale_ or just manually unscaling

AdaHessian

A core element AdaHessian is calculating hvs as below after scaling up the loss and performing the backward pass. The gradients need to be unscaled before being passed to autograd.grad.

hvs = torch.autograd.grad(gradsH, params, grad_outputs=v, only_inputs=True, retain_graph=False)

Using GradScaler.unscale_ produces nan

However despite manually verifying (printing min/mean of p.grad) that the gradients being passed to autograd.grad have been scaled down via scaler.unscale_(optimizer) it seems like autograd.grad doesnā€™t ā€œrespondā€(?) to this unscaling and subsequently produces nan or very high values for hvs

Manual unscaling works

However if, after gathering the gradients from the optimizer, I manually unscale them using the same scaling factor used in GradScaler, the calculation of hvs works as expected

Pseudo code

(see colab link for full version)

# AUTOCAST
with autocast(enabled=fp16_enabled):   
    output = net(input)
    loss = criterion(output, target)
      
# SCALE UP
scaler.scale(loss).backward(create_graph=True)

# SCALE DOWN
if not manual_unscale:
    scaler.unscale_(optimizer)     # UNSCALE GRADS WITH GRADSCALER   

# GATHER GRADSH AND PARAMS
gradsH = []
params = optimizer.param_groups[0]['params']
for p in optimizer.param_groups[0]['params']:
    gradsH.append(0. if p.grad is None else p.grad + 0.)

# MANUALLY UNSCALE (if manual_unscale==True)
if manual_unscale:    
    gradscaler_scale = scaler.get_scale()
    for g in gradsH: 
        g.div_(gradscaler_scale/scale_reduction)

# CALCULATE hvs
hvs = torch.autograd.grad(gradsH, params, grad_outputs=v, only_inputs=True, retain_graph=False)

Question

Does using GradScaler.unscale_ to unscale the gradients mean that this unscaling operation is not "recorded* " in the graph? Whereas gathering the gradients from the optimizer and manually unscaling them is ā€œrecordedā€?

*(sorry not sure what the correct terminology is here)

Is this behaviour expected and is there a workaround to unscaling that involves using GradScaler or should I just stick with manually unscaling the gradients i have gathered?

Comment

(I know that if I manually unscale then that unscaling will not be recorded by GradScaler and the optimizer gradients will be subsequently unscaled in GradScaler.step, however this should be ok I think as I will have manually unscaled a new list of gradients and the gradients in the optimizer will still be untouched by this manual scaling)

Thanks for reading this far, looking forward to hearing what anyone thinks!

1 Like

How did you define scale_reduction as this seems to be the main difference between both approaches, no?

Oh sorry meant to remove that, it is just set manually for now to be able to test different down-scaling values, scale_reduction=1 for now

scaler.unscale_(optimizer) unscales the .grad attributes of all params owned by optimizer, after those .grads have been fully accumulated for those parameters this iteration and are about to be applied. If you intend to accumulate more gradients into .grads later in the iteration, scaler.unscale_ is premature.

Also, the unscale+inf/nan check kernel used by scaler.unscale_ is not autograd-aware.

In use cases Iā€™ve seen, creating out-of-place gradients via torch.autograd.grad with create_graph=True is typically a setup for a double-backward that will accumulate gradients into the param.grad attributes. For these cases, manually unscaling the out-of-place grads before using them to set up the double-backward is expected: see the gradient penalty example.

In your snippet the control flow looks strange to me:

  1. first you run backward(create_graph=True) that accumulates gradients into param.grads.
  2. then you collect gradsH.append(0. if p.grad is None else p.grad + 0.), presumably the p.grad+0. ensures gradsH are separate tensors from param.grads (but if so why not also use an out-of-place torch.autograd.grad to create gradsH?)
  3. Then you use those .grads as torch.autograd.grad graph roots to create hvs out of place.

Your optimizer is vanilla SGD and you donā€™t appear to use hvs, so itā€™s hard for me to recommend a way to create it.

FYI GradScaler also has an enabled argument, ie

      if fp16_enabled: 
        scaler.scale(loss).backward(create_graph=True)
      else: 
        loss.backward(create_graph=True)

is equivalent to

  scaler = GradScaler(init_scale=init_scale)
  ...
  scaler.scale(loss).backward(create_graph=True)

Thank you for all this, it really helps alot!

Good to know, thanks

I tested this today and couldnā€™t get it to work for some reason, collecting gradsH and params from the optimizer seemed to be the only thing that worked (i.e. it trained with no errors but results were poorer).

Missed that, thanks

Oh yep, that was just some bare bones code to be able to run a single step. hvs is used to calculate the Hutchinson trace, which is subsequently used in a modification of AdamW

v = [torch.randint_like(p, high=2, device=device) for p in params]

for v_i in v: v_i[v_i == 0] = -1
        
hvs = torch.autograd.grad(gradsH, params, grad_outputs=v, only_inputs=True, create_graph=False, retain_graph=False)

hutchinson_trace = []
for hv, vi in zip(hvs, v):
    param_size = hv.size()
    if len(param_size) <= 1:  
        tmp_output = torch.abs(hv * vi)
        hutchinson_trace.append(tmp_output)

hutchinson_trace subsequently replaces the gradient in the denominator for the calculation of the AdamW step

BatchNorm had to be set to FP32

After implementing manual scaling, I got native AMP to work with my ResNet18, however I had to modify the input to my BatchNorm to explicitly be FP32 for it to work with larger input sizes (it worked for toy-size problems without modifying the input to BatchNorm).

Have you heard of others with this BN issue?
I see in the autocast docs its not explicitly defined whether BN should be FP16 or FP32. Maybe it just wasnā€™t working because of my double backward caseā€¦

Memory usage - autocast debugging

Despite successful training, when checking memory use via ā€˜nvidia-smiā€™ or ā€˜nvtopā€™ it doesnā€™t seem like the GPU memory usage has decreased. Its about the same as without native AMP and in fact after about 30 steps GPU memory usage increases by approx 800mb. I have identified that the increase happens after the second backward below is called for the 30th time.

hvs = torch.autograd.grad(gradsH, params, grad_outputs=v, only_inputs=True, retain_graph=False)

Note that 800mb also appears to be the increase in memory when the hvs is calculated in the very first step for the very first time, so maybe a copy is being made?

What might the best way to check what precision different operations are done in under autocast?
Checking dtype of the inputs/outputs of each layer?

Is there anything that you are aware of that would cause torch.autograd.grad memory usage to increase?
I checked the size of the tensors being input as below, but the inputs stay the exact same size from start to finish (as expected). The size of the output also remains the same.

gsize += g.element_size() * g.nelement()

Thanks again for your detailed reply above!