Grad() and loss.backward() conflict

mshuaibi · April 18, 2019, 8:26pm

As part of my model, I want to extract the gradients of my model output relative to its input. Such that my model looks something like this:

class Model(nn.Module):
        .
        .
        .
       def forward(self, input_data):
              output = MLP(input_data) 
              gradients = grad(output,input_data,grad_outputs=torch.ones_like(output))
              return output, gradients

Typically, my training loop looks as such:

 for epoch in range(num_epochs):

        for data_sample in dataloader:
            input_data=data_sample[0]
            target=data_sample[1]
            batch_size=len(target)
            target=target.reshape(400,1)

            input_data=input_data.to(device)
            target=target.to(device)

            def closure():
                optimizer.zero_grad()
                output, gradients=model(input_data)
                loss=criterion(output,target)
                loss.backward()
                return loss

            loss=optimizer.step(closure)

The idea is to then implement the gradients as part of the loss function for my given application. My issue is that when run: I get a “Trying to backward through the graph a second time, but the buffers have already been freed” error. When suppressing the line of code containing grad() in my model, training works normally. I read something along the lines that gradients are freed when .backward() is called, but not quite sure how to get around this. I tried changing the retain_graph argument in grad() to True but then I get an error saying "element 0 of tensors does not require grad and does not have a grad_fn). Any help is appreciated…Thanks!

michaelklachko · April 18, 2019, 8:52pm

Here’s how I implemented a penalty on gradient growth, perhaps this will be useful to you:

optimizer.zero_grad()
loss.backward(retain_graph=True)
grads = torch.autograd.grad(loss, inputs, create_graph=True)
# torch.autograd.grad does not accumuate the gradients into the .grad attributes 
# it instead returns the gradients as Variable tuples.
grad_sum = 0
for grad in grads:
    grad_sum += grad.pow(2).sum()
# grad_sum.backward() will accumulate the gradients into the .grad attributes
grad_sum.backward(retain_graph=True)
optimizer.step()

In my case, instead of inputs, I used layer activations, but should work similar for model inputs.

EDIT: you don’t need the retain_graph=True in grad_sum.backwards call if you don’t intend to do anything else with these gradients before the weight update.

mshuaibi · April 19, 2019, 3:05pm

Okay I seem to have got it to work - later lines in my code were conflicting with this portion. I needed to include create_graph=True in my grad() statement. I guess I don’t fully understand how create_graph and retain_graph work. What I’m seeking to do is at every epoch, along with my prediction, I want to calculate the gradient of my prediction w.r.t the input. The idea is to then implement as part of my loss function. How does including create_graph=True fit into all this? Thanks for your help!

Edit - I also noticed that if I change retain_graph=True it works as well. So it really is a question of which one to utilize and what their differences are. Especially when it comes to speed.

michaelklachko · April 19, 2019, 6:20pm

My understanding is that create_graph will construct the whole graph for the backwards pass, this graph is needed when you want to calculate gradients of the gradients, as I do in my example: by calling grad_sum.backwards() I calculate the gradients of the loss function in respect to the grads.

The flag retain_graph preserves this backwards graph, in case you need to use it again for some reason, for example, in my case I also wanted to penalize the growth of the second order gradients (diag of the hessian).