What exactly does `retain_variables=True` in `loss.backward()` do?

dl4daniel · May 29, 2017, 8:44am

In the doc it says:

retain_variables (bool): If ``True``, buffers necessary for computi
 133                     gradients won't be freed after use. It is only necessary to
 134                     specify ``True`` if you want to differentiate some subgraph mul
 135                     times (in some cases it will be much more efficient to use
 136                     `autograd.backward`).

one way to understand it is to “keep all variables or flags associated with computing gradients” while retain_variables=False make values of variables or flags gone.

However, I am not sure I understand it properly, I would like to see the differences in codes and outputs. So, I tried to look into the source code, and here is as far as I can go below:

def backward(self, gradient=None, retain_variables=False):
 117             """Computes the gradient of current variable w.r.t. graph leaves.
 118
 119             The graph is differentiated using the chain rule. If the variable is
 120             non-scalar (i.e. its data has more than one element) and requires
 121             gradient, the function additionaly requires specifying ``gradient``.
 122             It should be a tensor of matching type and location, that contains
 123             the gradient of the differentiated function w.r.t. ``self``.
 124
 125             This function accumulates gradients in the leaves - you might need to z
 126             them before calling it.
 127
 128             Arguments:
 129                 gradient (Tensor): Gradient of the differentiated function
 130                     w.r.t. the data. Required only if the data has more than one
 131                     element. Type and location should match these of ``self.data``.
 132                 retain_variables (bool): If ``True``, buffers necessary for computi
 133                     gradients won't be freed after use. It is only necessary to
 134                     specify ``True`` if you want to differentiate some subgraph mul
 135                     times (in some cases it will be much more efficient to use
 136                     `autograd.backward`).
 137             """
 138             if self.volatile:
 139                 raise RuntimeError('calling backward on a volatile variable')
 140             if gradient is None and self.requires_grad:
 141                 if self.data.numel() != 1:
 142                     raise RuntimeError(
 143                         'backward should be called only on a scalar (i.e. 1-element
 144                         'or with gradient w.r.t. the variable')
 145                 gradient = self.data.new().resize_as_(self.data).fill_(1)
 146  ->         self._execution_engine.run_backward((self,), (gradient,), retain_variab
 return None

Apparently, to see what exact does retain_variables do in run_backward, I have to at least a level deeper, but step in pdb won’t take me there, it just return None. So, I am stuck.

could anyone help me here? Thanks

mratsim · May 29, 2017, 9:08am

After loss.backward you cannot do another loss.backward unless retain_variables is true.

In plain words, the backward proc will consume the intermediate saved Tensors (Variables) used for backpropagation unless you explicitly tell PyTorch to retain them.

dl4daniel · May 29, 2017, 12:02pm

Thanks a lot! Your explanation is helpful.

I still wonder how can I use some code to experiment on the differences made with and without retain_vairables.

Could you give me some simple code examples on showing the differences?

or could you show me how to access the source code of self._execution_engine.run_backward?

Thanks a lot!

mratsim · May 29, 2017, 1:51pm

Usually after a backpropagation you process the next batch so you don’t need the gradients of the previous batch anymore.

Besides debugging I don’t have any scenario for which I need to backprop twice through the same operation graph, I’m sure there are though.

The default is to drop the variables/gradient to save on memory.

All the source code of PyTorch is on github. From the name, I guess that self._execution_engine.run_backward is traversing the operation graph in reverse, calling backward if defined, or using the autograd if not.

Each node in the graph have several properties that are defined in the autograd folder of PyTorch

11117 · August 1, 2017, 9:27am

Actually, when we train a GAN, we usually should calculate the D_loss for discrimination net and G_loss for generator net And after calculating the gradient of D_loss,we need to retain variables for the calculation of G_loss,which is a typical example the “retain_variables=True” use.

tom · August 1, 2017, 10:38am

In my (very limited) experience, you do not compute discriminator and generator gradients for the same forward step because the loss objectives have opposing signs.

Best regards

Thomas

11117 · August 4, 2017, 3:19am

Thank you for reply, I now think my example has a little inaccuracy, and you are right.

Thanks for your correction!

jdhao · November 8, 2017, 12:21pm

I think a concrete case where retain_graph=True is helpful is multi-task learning where you have different losses at different layers of the network. So in order to back-propagate the gradient of each loss w.r.t to the parameters of the network, you will need to set retain_graph=True, or you can only do backward for one of the many losses.

yoelshoshan · January 4, 2018, 7:05am

Another example where it is useful is shown here:

github.com

pkdn/pytorch-smoothgrad/blob/master/lib/gradients.py#L79


    one_hot = np.zeros((1, output.size()[-1]), dtype=np.float32)
    one_hot[0][index] = 1
    if self.cuda:
        one_hot = Variable(torch.from_numpy(one_hot).cuda(), requires_grad=True)
    else:
        one_hot = Variable(torch.from_numpy(one_hot), requires_grad=True)
    one_hot = torch.sum(one_hot * output)


    if x_plus_noise.grad is not None:
        x_plus_noise.grad.data.zero_()
    one_hot.backward(retain_variables=True)


    grad = x_plus_noise.grad.data.cpu().numpy()


    if self.magnitutde:
        total_gradients += (grad * grad)
    else:
        total_gradients += grad
    #if self.visdom:


avg_gradients = total_gradients[0, :, :, :] / self.n_samples

Here the input image is treated as variables, and a gradient is calculated to see which parts in the image will have most influence on the global classification decision.
This is useful when you trained a classifier without any localization and you still want to see some localization information.

jpeg729 · January 4, 2018, 7:35am

The question asks about retain_variables, some answers talk about retain_graph. What is the difference between the two?

jdhao · January 4, 2018, 8:33am

It is essentially the same, retain_variables argument has been deprecated in favor of retain_graph.

devansh20la · April 9, 2018, 11:40pm

I understand If I have two loss functions in different parts of the network, I’ll have use retain_graph. What if I add both the losses and do total_loss.backward() ?

for example:

Rather than

loss1.backward(retain_graph=True)
loss2.backward()
opt.step()

I would just do

total_loss = loss1 + loss2
total_loss.backward()
opt.step()

smth · April 10, 2018, 4:09am

@devansh20la that is pretty legit and works well.

PetrusKe · April 16, 2018, 12:24pm

Hi, @smth ,
I just started to use pytorch recently. And I also comfuse about above problem.
And I think above two ways may not same?

While use:
loss1.backward(retain_graph=True)
_ loss2.backward()_
_ opt.step()_
the layers between loss1 and loss2 will only calculate gradients from loss2.
and the layers before loss1 will calculate gradientes as sum of loss1+loss2

but if use:
total_loss = loss1 + loss2
_ total_loss.backward()_
_ opt.step()_
all layers will calculate gradient by using loss_value = loss1 + loss2

What do you think about it?

devansh20la · July 6, 2018, 1:30pm

derivative of weights of model 2 wrt to loss 1 would be zero.

STU · February 24, 2019, 7:17am

‘loss1.backward(retain_graph=True)’ will update the model-1’s parameters,
when we do loss2.backward(), will update the model-1?

STU · February 24, 2019, 7:26am

we need the variables or the part of computational graph of model-1, but we won’t update it’s parameters? e.g. in GAN. Are there any mistakes with my understanding?

devansh20la · February 24, 2019, 7:31am

Yes, it will update model1 parameters. Gradients from loss2 will flow all the way to model1.

Insight_Gong · October 1, 2019, 4:32am

So what’s the difference between these 2 methods?
I don’t know how the loss1+loss2 is trying to backward.
Do they have diff? What’s the diff?

Insight_Gong · October 1, 2019, 5:44am

As far as I think, loss = loss1 + loss2 will compute grads for all params, for params used in both l1 and l2, it sum the grads, then using backward() to get grad.
Meanwhile, loss1.backward() and loss2.backward() means seperately compute grads in loss1 and loss2, and get grads themselves.

The diff is that: When using optimizer to update step size using gd method, sum(loss).backward() using only 1 optimizer with sum of grads; and l1, l2 using 1 optimizer(or even 2 optimizer) to compute step-size based on the grads, for params which l1 and l2 owns in public, maybe the step-size is depended on l2 loss(which computes later).