Differentiating With Respect to Learning Rate

I’m trying to compute the derivative of the error with respect to the learning rate (see below). I set up the learning rate as a tensor with gradient tracking on and then use the learning rate to update the parameters. I’m trying to use autograd to take the derivative of the error wrt the learning rate, and my understanding is that setting create_graph = True allows higher-order derivatives to be taken. I get the RuntimeError “One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.” I want to use the learning rate in the gradient calculation, and setting allow_unused = True returns None. Is there a workaround for differentiating the error wrt hyperparameters like the learning rate, or am I setting something up wrong?

I know this may be slow or not the intended usage, but is there any way to accomplish a gradient wrt the learning rate using pytorch other than resorting to numeric derivatives? I thought maybe there would be a way to apply the changes to param through a copy of the model. But I don’t know how to get autograd to track it.

model = nn.Linear(2, 2, bias=True)

x = torch.rand(10,2)
y = x + 0.6

lr = nn.Parameter(torch.tensor(0.01, requires_grad=True))

func_loss = torch.nn.MSELoss()
err = func_loss(model(x), y)
grad = torch.autograd.grad(err, model.parameters(), create_graph=True)

for param, g in zip(model.parameters(), grad):
  # same as before:
  # RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
  param = param - lr * g

err = func_loss(model(x), y)
print( torch.autograd.grad(err, lr, create_graph=True) )

To avoid the “a leaf variable that requires grad is being used in a in-place operation” you’ll need to do the operation out of place (which it looks like you are doing).

param = param - lr * g doesn’t what you think it does. You aren’t actually swapping out the parameters here, you’re just updating the local variable.

You can use torch.func.functional_call — PyTorch 2.6 documentation so that instead of needing to swap out params, you can explicitly pass in new ones.

Thank you. What I’m looking to do is take the derivative with respect to the learning rate at one step in the model. I want to understand how changing the learning rate affects the error. I understand that you can’t update the actual parameters as I’m doing if gradient tracking is on. Is there some way to create a structure where you can differentiate the error with respect to the learning rate?

Maybe there’s some way to get the parameters from the model and load them back in, somehow keeping gradient information? Is this the best way to do this? I assume I’d use functional_call.

Yeah that is right, functional_call should allow you to modify your parameters out-of-place and then pass the new parameters back in directly as inputs to your model.