As part of my model, I want to extract the gradients of my model output relative to its input. Such that my model looks something like this:
class Model(nn.Module):
.
.
.
def forward(self, input_data):
output = MLP(input_data)
gradients = grad(output,input_data,grad_outputs=torch.ones_like(output))
return output, gradients
Typically, my training loop looks as such:
for epoch in range(num_epochs):
for data_sample in dataloader:
input_data=data_sample[0]
target=data_sample[1]
batch_size=len(target)
target=target.reshape(400,1)
input_data=input_data.to(device)
target=target.to(device)
def closure():
optimizer.zero_grad()
output, gradients=model(input_data)
loss=criterion(output,target)
loss.backward()
return loss
loss=optimizer.step(closure)
The idea is to then implement the gradients as part of the loss function for my given application. My issue is that when run: I get a “Trying to backward through the graph a second time, but the buffers have already been freed” error. When suppressing the line of code containing grad() in my model, training works normally. I read something along the lines that gradients are freed when .backward() is called, but not quite sure how to get around this. I tried changing the retain_graph argument in grad() to True but then I get an error saying "element 0 of tensors does not require grad and does not have a grad_fn). Any help is appreciated…Thanks!
Here’s how I implemented a penalty on gradient growth, perhaps this will be useful to you:
optimizer.zero_grad()
loss.backward(retain_graph=True)
grads = torch.autograd.grad(loss, inputs, create_graph=True)
# torch.autograd.grad does not accumuate the gradients into the .grad attributes
# it instead returns the gradients as Variable tuples.
grad_sum = 0
for grad in grads:
grad_sum += grad.pow(2).sum()
# grad_sum.backward() will accumulate the gradients into the .grad attributes
grad_sum.backward(retain_graph=True)
optimizer.step()
In my case, instead of inputs, I used layer activations, but should work similar for model inputs.
EDIT: you don’t need the retain_graph=True
in grad_sum.backwards
call if you don’t intend to do anything else with these gradients before the weight update.
1 Like
Okay I seem to have got it to work - later lines in my code were conflicting with this portion. I needed to include create_graph=True
in my grad() statement. I guess I don’t fully understand how create_graph and retain_graph work. What I’m seeking to do is at every epoch, along with my prediction, I want to calculate the gradient of my prediction w.r.t the input. The idea is to then implement as part of my loss function. How does including create_graph=True fit into all this? Thanks for your help!
Edit - I also noticed that if I change retain_graph=True it works as well. So it really is a question of which one to utilize and what their differences are. Especially when it comes to speed.
My understanding is that create_graph
will construct the whole graph for the backwards pass, this graph is needed when you want to calculate gradients of the gradients, as I do in my example: by calling grad_sum.backwards()
I calculate the gradients of the loss function in respect to the grads
.
The flag retain_graph
preserves this backwards graph, in case you need to use it again for some reason, for example, in my case I also wanted to penalize the growth of the second order gradients (diag of the hessian).
1 Like