Why is closure not supported in GradScaler ?

ar90n · November 4, 2021, 5:16pm

In the step function of GradScaler, if a closure is given as a member of kwargs, RuntimeError occurs by the following codes.

github.com

pytorch/pytorch/blob/04fe4382ec8e6031d353530d6b8446ac9c9f507e/torch/cuda/amp/grad_scaler.py#L312-L313

    
      
          if "closure" in kwargs:
              raise RuntimeError("Closure use is not currently supported if GradScaler is enabled.")

The error message says “not currently supported”. Are there any plans to support this feature? Or Would you tell me reasons why closure isn’t supported? If possible, I want to try to make a patch to solve this issue.

My motivation is to support SAM with native amp easily.

mcarilli · November 8, 2021, 7:45pm

Making closures work with dynamic gradient scaling (specifically, the fact that dynamic gradient scaling occasionally skips optimizer.step() if any grads were inf/nan) is tricky, and we haven’t heard any use cases that absolutely needed it (LBFGS is the only one I’m aware of and no one’s asked for that).

Can you implement SAM without a closure like this?

ar90n · November 9, 2021, 6:44am

Thank you for your reply!
As you pointed, I use davda54’s SAM implementation without closure.
And it works well.

But I use pytorch-lightning. In pytorch-lightning environment, I have to write some extra codes to calculate gradients two times. My motivation is to solve this problem. I want to use SAM as other optimizers.

I have understood that closure support is tricky. I decided to use SAM without closure. Thank!!

Alex_Li · March 11, 2022, 11:19pm

Hi, I have the same problem you did. How did you code fp16 support without closure successfully using the linked github repo? It seems to me that we have to call scaler.step(optimizer), and this in turn will call optimizer.step() with no closure parameter. To use the library, we need to call first_step() and second_step(), so I don’t see how it is possible to use this implementation and float16 support.

ar90n · March 12, 2022, 1:34am

Hi, I don’t remember clearly but I’m sure I used codes from pytorch-lightning’s example as follows in README.md.

def training_step(self, batch, batch_idx):
    optimizer = self.optimizers()

    # first forward-backward pass
    loss_1 = self.compute_loss(batch)
    self.manual_backward(loss_1, optimizer)
    optimizer.first_step(zero_grad=True)

    # second forward-backward pass
    loss_2 = self.compute_loss(batch)
    self.manual_backward(loss_2, optimizer)
    optimizer.second_step(zero_grad=True)

    return loss_1