Computing gradients with Apex

kl_divergence · June 16, 2020, 5:28am

I want to compute gradients in the standard manner:

grad = torch.autograd.grad(loss, model.parameters()

I want to use AMP. To get the scaled loss, apex api suggests:

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

But I don’t want to call loss.backward() directly. I compute the gradients in the manner I showed above
I tried this:

with amp.scale_loss(loss, optimizer) as scaled_loss:
    grad = torch.autograd.grad(scaled_lossloss, model.parameters()

This makes the loss nan. How to compute gradients with apex ?

ptrblck · June 16, 2020, 8:59am

We recommend to use the native automatic mixed-precision training using a master build or the nightly binaries.
For your use case, you might have forgotten to unscale the gradients or might have manually updated the parameters with invalid gradients.

Have a look at the gradient penalty example for an example use case of autograd.grad.

kl_divergence · June 16, 2020, 1:49pm

Thanks for your reply. I did what you said. Before I can test your suggestion, now I’m facing some other issue.
Without fp16, I’m able to train with batch size 128.
After amp.initialize() is called, I am getting an OOM error with the same batch size which is not ideal since RAM usage should decrease. A similar issue has been raised here.

ptrblck · June 16, 2020, 7:20pm

The linked issue initialized models in a loop, which is not supported in apex/amp.
Could you post a reproducible code snippet using native amp, so that we could have a look?

kl_divergence · June 17, 2020, 1:11pm

Regarding first issue, as per your suggestion, I modified my code as:

if self.args.fp16:
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_grad_params = torch.autograd.grad(self.scaler.scale(scaled_loss), params, allow_unused=True)

Before I can multiply params by inv_scale, it throws an error because scaled_grad_params is None.
I assume that amp.scale_loss is not supposed to be used like this. I searched in the documentation, this method is used with loss.backward() which I’m not calling. Can you please suggest what am I missing ?

ptrblck · June 18, 2020, 2:45am

Is your code working without amp, i.e. are you getting the sum of gradients from autograd.grad?

kl_divergence · June 18, 2020, 6:58am

Yes, my code works fine without AMP.
I compute grads:

grads = torch.autograd.grad(loss, model.parameters())

Then update my model’s parameters:

for param in model.parameters():  
    param.data = ....  # (these params come from grads after some operation)

ptrblck · June 18, 2020, 10:11am

I don’t know, why the output is apparently empty when you are using amp, so we would need to debug it.
Could you post a minimal, executable code snippet so that we could have a look?