Backpropagating through multiple optimizer steps

How can I set up a model/optimizer so that I can take multiple steps for, say, a training loss function, and then calculate a gradient of my final training loss with respect to the starting parameters or the weights on my training data?

When I try to set up something like this now, I can’t get any gradients when I use the model/optimizer abstractions suggested in PyTorch tutorials. The optimizer abstraction failing me makes sense because it seems to modify the underlying data store when it updates parameters, so the computation graph presumably wouldn’t catch those changes. However, it’s unclear to me why I still can’t solve this problem even when I loop through my model’s parameters and apply gradient descent manually. Thanks!

I’m not sure I understand the issue, so please correct me if I’m wrong.
As far as I understand you would like to pass multiple batches through your model, calculate the loss for each, and use the accumulated loss to update the gradients?

If so, you could just do exactly this or alternatively call loss.backward() in each iteration as this will also accumulate the gradients:

model = nn.Linear(10, 2)
criterion = nn.CrossEntropyLoss()
x = torch.randn(1, 10)
target = torch.empty(1, dtype=torch.long).random_(2)
losses = 0
for _ in range(10):
    output = model(x)
    loss = criterion(output, target)
    losses += loss
print(losses)
losses.backward()
print(model.weight.grad)

model.zero_grad()
for _ in range(10):
    output = model(x)
    loss = criterion(output, target)
    loss.backward()
print(model.weight.grad)

Here is also a good explanation of the different approaches.

Unfortunately, no, that’s not what I’m looking to do. Using your example as a start:

model = nn.Linear(10, 2)
criterion = nn.CrossEntropyLoss()
opt = AdamOptimizer(model.parameters())
x = torch.randn(1, 10)
target = torch.empty(1, dtype=torch.long).random_(2)
losses = 0
weights = torch.ones(10, requires_grad=True)
for _ in range(10):
    output = model(x)
    loss = criterion(output, target, reduction='none')
    w_loss = (loss * weights).sum() / weights.sum()
    losses += w_loss
print(losses)
losses.backward()
opt.step()

weights.grad.data.zero_()
model.zero_grad()
for _ in range(10):
    output = model(x)
    loss = criterion(output, target)
    loss.backward()
## now I would like the gradient with respect to the weights (but I find that it is 0)

Before your second loop you are zeroing the gradients of weights and don’t use weights anymore after it.
How should the weight gradient be calculated? Do you want to keep the last gradient of weights and sum it to the new loss?

So, if I was able to backpropagate through the gradient step in the second loop, I would have a dependence on weights as the gradient step I took between the two loops has some dependence on the weights on the original loss function.

The computation graph is created in each new forward pass. Because weights isn’t used in your second loop, the computation graph won’t include it at all.
If you want to use weight somehow, you would need to add it to the computation for the loss calculation.