retain_variables (bool): If ``True``, buffers necessary for computi
133 gradients won't be freed after use. It is only necessary to
134 specify ``True`` if you want to differentiate some subgraph mul
135 times (in some cases it will be much more efficient to use
136 `autograd.backward`).
one way to understand it is to “keep all variables or flags associated with computing gradients” while retain_variables=False make values of variables or flags gone.
However, I am not sure I understand it properly, I would like to see the differences in codes and outputs. So, I tried to look into the source code, and here is as far as I can go below:
def backward(self, gradient=None, retain_variables=False):
117 """Computes the gradient of current variable w.r.t. graph leaves.
118
119 The graph is differentiated using the chain rule. If the variable is
120 non-scalar (i.e. its data has more than one element) and requires
121 gradient, the function additionaly requires specifying ``gradient``.
122 It should be a tensor of matching type and location, that contains
123 the gradient of the differentiated function w.r.t. ``self``.
124
125 This function accumulates gradients in the leaves - you might need to z
126 them before calling it.
127
128 Arguments:
129 gradient (Tensor): Gradient of the differentiated function
130 w.r.t. the data. Required only if the data has more than one
131 element. Type and location should match these of ``self.data``.
132 retain_variables (bool): If ``True``, buffers necessary for computi
133 gradients won't be freed after use. It is only necessary to
134 specify ``True`` if you want to differentiate some subgraph mul
135 times (in some cases it will be much more efficient to use
136 `autograd.backward`).
137 """
138 if self.volatile:
139 raise RuntimeError('calling backward on a volatile variable')
140 if gradient is None and self.requires_grad:
141 if self.data.numel() != 1:
142 raise RuntimeError(
143 'backward should be called only on a scalar (i.e. 1-element
144 'or with gradient w.r.t. the variable')
145 gradient = self.data.new().resize_as_(self.data).fill_(1)
146 -> self._execution_engine.run_backward((self,), (gradient,), retain_variab
return None
Apparently, to see what exact does retain_variables do in run_backward, I have to at least a level deeper, but step in pdb won’t take me there, it just return None. So, I am stuck.
After loss.backward you cannot do another loss.backward unless retain_variables is true.
In plain words, the backward proc will consume the intermediate saved Tensors (Variables) used for backpropagation unless you explicitly tell PyTorch to retain them.
Usually after a backpropagation you process the next batch so you don’t need the gradients of the previous batch anymore.
Besides debugging I don’t have any scenario for which I need to backprop twice through the same operation graph, I’m sure there are though.
The default is to drop the variables/gradient to save on memory.
All the source code of PyTorch is on github. From the name, I guess that self._execution_engine.run_backward is traversing the operation graph in reverse, calling backward if defined, or using the autograd if not.
Each node in the graph have several properties that are defined in the autograd folder of PyTorch
Actually, when we train a GAN, we usually should calculate the D_loss for discrimination net and G_loss for generator net And after calculating the gradient of D_loss,we need to retain variables for the calculation of G_loss,which is a typical example the “retain_variables=True” use.
In my (very limited) experience, you do not compute discriminator and generator gradients for the same forward step because the loss objectives have opposing signs.
I think a concrete case where retain_graph=True is helpful is multi-task learning where you have different losses at different layers of the network. So in order to back-propagate the gradient of each loss w.r.t to the parameters of the network, you will need to set retain_graph=True, or you can only do backward for one of the many losses.
Here the input image is treated as variables, and a gradient is calculated to see which parts in the image will have most influence on the global classification decision.
This is useful when you trained a classifier without any localization and you still want to see some localization information.
I understand If I have two loss functions in different parts of the network, I’ll have use retain_graph. What if I add both the losses and do total_loss.backward() ?
Hi, @smth ,
I just started to use pytorch recently. And I also comfuse about above problem.
And I think above two ways may not same?
While use:
loss1.backward(retain_graph=True)
_ loss2.backward()_
_ opt.step()_
the layers between loss1 and loss2 will only calculate gradients from loss2.
and the layers before loss1 will calculate gradientes as sum of loss1+loss2
but if use:
total_loss = loss1 + loss2
_ total_loss.backward()_
_ opt.step()_
all layers will calculate gradient by using loss_value = loss1 + loss2
we need the variables or the part of computational graph of model-1, but we won’t update it’s parameters? e.g. in GAN. Are there any mistakes with my understanding?
As far as I think, loss = loss1 + loss2 will compute grads for all params, for params used in both l1 and l2, it sum the grads, then using backward() to get grad.
Meanwhile, loss1.backward() and loss2.backward() means seperately compute grads in loss1 and loss2, and get grads themselves.
The diff is that: When using optimizer to update step size using gd method, sum(loss).backward() using only 1 optimizer with sum of grads; and l1, l2 using 1 optimizer(or even 2 optimizer) to compute step-size based on the grads, for params which l1 and l2 owns in public, maybe the step-size is depended on l2 loss(which computes later).