Why do we need to set the gradients manually to zero in pytorch?

justusschock · February 25, 2019, 8:18am

If you don’t specify other behavior (retain_graph=True or something like that) you are right.

Weifeng · February 25, 2019, 9:51am

As @justusschock said, the memory consumption is the same if neglecting overhead, but the graph structures are different. Illustrations are more clear:

%E6%9C%AA%E5%91%BD%E5%90%8D%E6%96%87%E4%BB%B6

%E6%9C%AA%E5%91%BD%E5%90%8D%E6%96%87%E4%BB%B6%20(1)

AFAIK, graphs cannot be explicitly manipulated. Only indirect ways like backward and name scoping are available.

ForeverZH0204 · February 25, 2019, 10:48am

Thank you!
But why you mention

graphs cannot be explicitly manipulated

what are are explicitly manipulation or what do you imply?

Weifeng · February 25, 2019, 12:48pm

Sorry for confusing. I just wanna say, the computation graph is created and freed automatically, behind the operations.
Understanding what happend to graph is helpful.

Maybe it’s a good idea to add a graph semantic section in the documentation, elaborating various cases involving.

heisenburgzero · February 27, 2019, 6:18pm

Do I only need to divide by iter_size if the loss function takes the average ?
Let’s say if I’m doing sum of squared errors, should I call backward() without dividing loss by iter_size?

Also, do I also need to be worried about batch normalization in this case if I don’t divide by iter_size?
Thanks!

mahesh_bhosale · April 25, 2019, 11:14am

Do you mean we should not zero out gradients for RNN? I thought the cell state would be doing most of the work of remembering the information.

blackbirdbarber · July 5, 2019, 7:49am

This looks like the main reason why the design decision is made not to remove the gradients.
What I was trying to understand isn’t it the best time to remove the gradients after the optimizer step.

The idea is we don’t need to track when a certain functions ends, we just track the optimizer.
However, I am uncertain if this logic fits all the use cases, such as GANs, but I would like to hear the opinions.

I actually think zeroing gradients can be a design decision (with default to zero them).

Jaideep_Valani · July 14, 2019, 6:37am

hi alban…
Thanks for your inputs above…
I am doing the grad accumulation using this way… could you please let me know if there is any problem here

total_loss=0.
        learn.model.train()
        for i,(xb,yb) in enumerate(learn.data.train_dl):
            #print('i',i)
            
            loss = learn.loss_func(learn.model(xb), yb)
            if (i+1) %2==0: # doing grad accumulation
              loss=total_loss*0.9+loss*0.1 # some loss smoothning
              #print('i',i)
              loss.backward()
              learn.opt.step()
              learn.opt.zero_grad()
              total_loss=0.
            else:
              total_loss+= loss # accumulate the loss

albanD · July 22, 2019, 8:24am

Hi,

This looks good to me !

albanD · July 22, 2019, 8:27am

Hi,

It depends a lot on your usecase. sometimes you want to keep some gradients longer (to compute some statistics?).
But I agree that this is just a design decision. Unfortunately the original decision that was made was not to zero them and I don’t think we can change it now (for backward compatibility reasons).

pinocchio · October 12, 2019, 11:58pm

what if we don’t have an optimizer and just the model?

pinocchio · October 13, 2019, 12:00am

Is the accumulation of gradients the assumption that each time we do a backward pass we are including it from other loss functions or something and we are adding?

I don’t understand why accumulation is the operation to do with multiple backward passes.

pinocchio · October 13, 2019, 12:09am

why do we backpropagate through an RNN several times and therefore make accumulation justifiable? Why accumulation?

mingtop_jamin · November 20, 2019, 3:07am

Is the total_loss need to be divided by 64 and batchsize ?

In my option, the batchsize is not needed, the loss function has done the mean to you, right?

mingtop_jamin · November 20, 2019, 3:18am

I think we don’t need to divide the batchsize and item_size in the loss1 += loss_2, as the loss is back alone there different graph_path. the loss plus here is not the iter_num. Right?

cite_me_please · December 25, 2019, 8:41am

Thanks for your solution! I have a small question for your 2nd and 3rd methods.
For example, the batch size is set to 10, and I want the learning rate to be 1e-3, so do I have to mannually set the learning rate to 1e-3 / batch_size (1e-4 in this case)? Thanks!

albanD · December 26, 2019, 6:20pm

It depends on what your loss is.
If it should be the average of the loss of all samples, you want to divide the final loss by the number of mini-batches you used.
If it should be the sum, then they will be the same as 1) already.

pinocchio · January 7, 2020, 10:04pm

what do you mean that “RNNs are back propagated several times”? I’ve always considered RNNs as a long feedward net really, so we only back prop through it once like any other model…

pinocchio · January 7, 2020, 10:06pm

Note sure what you mean by “accumulates gradients”. Calling backward twice is not even allowed without explicitly trying to do it. See trivial code and errors:

import torch

w = torch.tensor([4.0], requires_grad=True)

l = (w - 1.0)**2
l.backward()

print('---1st backard call---')
print(w.grad)
print()

print('---2nd backward call---')
l.backward()

errors:

python backward_pass_twice.py
---1st backard call---
tensor([6.])

---2nd backward call---
Traceback (most recent call last):
  File "backward_pass_twice.py", line 13, in <module>
    l.backward()
  File "/Users/me/miniconda3/envs/ml/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/Users/me/miniconda3/envs/ml/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

pinocchio · January 7, 2020, 10:13pm

albanD:

Hi,

I think the big difference with tensorflow is the following.
Since you use a static graph, you define exactly what should be done to make one gradient computation/update. And then you just tell it to do it using a given input/target.

In pytorch, it is significantly more flexible as the autograd engine will just “remember” how to compute the gradient for a given variable while you are performing computations with this Variable. This means that you can get the gradients wrt a variable, then perform computation with it again, then recompute gradients corresponding to these new operations.
In this scheme, there is a not a single point where you stop performing “forward” operations and you know that the only thing that is left to be done is compute the gradients. So it is trickier to automatically set the gradients to 0 because you don’t know when a computation end, and when a new starts.

An example where the gradient accumulation is useful is for example if you share some part of a network for two different tasks:
input = Variable(data)
# Get the features
features = feature_extractor(input)

# Compute first loss and get the gradients for it
loss1 = task1(features)
loss1.backward(retain_graph=True)
# This add the gradients wrt loss1 in both the "task1" net and the "feature_extractor" net
# So each parameter "w" in "feature_extractor" has it gradient d(loss1)/dw

# Perform the second task and get the gradients for it as well
loss2 = task2(features)
loss2.backward()
# This will add gradients in "task2" and accumulate in "feature_extractor"
# Now each parameter in "feature_extractor" contains d(loss1)/dw + d(loss2)/dw
So the fact that the gradients are accumulated allows you to get the correct gradient for all the computations that you do with a given Variable even if you use it at multiple places in convoluted ways.
The drawback here is that you have to manually reset the values to 0 so that the gradients computed previously do not interfere with the ones you are currently computing.

Excellent example! Thank you.

But why did you not need to call retain_graph=True?