Why do we need to set the gradients manually to zero in pytorch?

albanD · January 7, 2020, 10:15pm

To be able to call .backward() twice.
That is also the answer to your question above: calling backward twice is allowed if you specify retain_graph=True in the first call.

pinocchio · January 7, 2020, 10:22pm

darn it your right. I didn’t see your first call to backward had that flag. Ooops!

loss.backward(retain_graph=True) # needed to call backward again

G.M · March 6, 2020, 2:29am

So if retain_graph=False then after the first backward the graph will be freed and calling the second .backward would cause an error, am I correct?

albanD · March 6, 2020, 2:40pm

Yes you are correct.

Note that in some cases, you won’t see the error because there was nothing saved. And so nothing was freed But you shuld not rely on it.

sharkdeng · April 27, 2020, 9:08am

Thanks for the answering. I found the loss would increase without optimizer.zero_grad().

TD_Ye · April 30, 2020, 1:53pm

Link: https://cs231n.github.io/optimization-2/
The forward expression involves the variables x,y multiple times, so when we perform backpropagation we must be careful to use += instead of = to accumulate the gradient on these variables (otherwise we would overwrite it). This follows the multivariable chain rule in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add.

pinocchio · June 30, 2020, 6:25pm

Just to double check.

It only accumulate it does not divide by the total number of elements it accumulated, right?

(I have a loop over a batch size cuz its so big and special)

albanD · June 30, 2020, 6:31pm

Yes the backward pass only accumulates into the exiting .grad when it exists.

ziyunY · July 14, 2020, 7:36pm

Hi,

I have extra question about BN layer here.

So if we want to use the variable sized input and we do this kind of gradient accumulation. Lets say the ‘real batch size’ we want is 32. So we iter the batch for 32 times and do a optim.step(). So my question is, compared to the normal way(which have batch_size of 32 and do a optim.step every iteration), how will the Batch Normalization affect the results since in the forward() process we only input batch with size 1?

albanD · July 14, 2020, 8:20pm

Batchnormalization with inputs of size 1 during training might not behave very well.
It usually expect a batch of several samples to be able to do the normalization properly.

ziyunY · July 14, 2020, 8:31pm

Thanks for prompt reply!

So if we want to input the variable sized image to our network using above grads accumulation method, how should we deal with the batchnorm layers in the original model? Should we just delete the BN layers or use a model.eval() mode for training (I have seen someone done this for their training so that the BN would not be used in forward() process)?

And lets say if we abandon all the BN layers of a existing model in the training, how will this affect the model performance?

Thanks!

albanD · July 14, 2020, 8:37pm

Unfortunately, I can’t think of a way to still use batchnorm as it needs the values from all the batch to proceed.
The effect on final performance will depend a lot on your model/application/dataset , you’ll have to try it in your context and see how it works.

ziyunY · July 14, 2020, 8:42pm

Thanks. I will do that.

Jaideep_Valani · July 27, 2020, 3:53pm

@albanD in very recent posts about grad accumulation i read that Grad acc has its benefits only if you also align batch norm statistics accordingly.
But i couldnt figure out what needs to be changed for batch norm if we accumulate more gradients

Forward pass activations calculation ,that is changing /N to Number of steps x N
Backward pass with similar change
or any thing else ?
I am not getting a good ans so far in any forums about this

albanD · July 27, 2020, 4:19pm

I’m not sure what you mean by “align statistics accordingly”. My only answer here is that it won’t do the same thing as if you had a single batch. But I don’t know how this should be updated to work with batchnorm.

BramVanroy · September 16, 2020, 1:11pm

What do you do if the number of samples (size of dataset) is not a multiple of accumulation steps? Let’s say that the dataset has three samples and we accumulate every two steps. That means that the last sample is not optimised. Depending on the size of accumulation w.r.t. the size of the dataset, that can be problematic. How does one typically deal with that?

albanD · September 16, 2020, 2:58pm

Hi,

I would say that depend a lot on your use case. The only ones I’ve seen right now, the dataset size has been much larger than the accumulation size and so the partial batch/accumulations are just dropped and will most likely be used in the next epoch.
In your case, you might want to make the accumulation roll over to the next epoch?
Note that as long as you stop before doing the iteration if you know you won’t do it until the end and you randomly sample your dataset, it won’t actually be slower to just discard partial accumulations. You will just run each epochs faster.

BramVanroy · September 16, 2020, 3:19pm

Thanks for the quick reply! I do not actually have this problem but I was wondering about it from a theoretical perspective.“Rolling over” is indeed what I figured would be best, or use a randomized “infinite” dataloader where you simply do not count in epochs but steps.

Anant_Dev · October 8, 2020, 4:13am

Thanks a lot! I could not understand what does accumulating the gradient meant.

CrystalWong2 · October 29, 2020, 7:25am

Hi，doing the grad accumulation using this way. But I found that loss will increase.

    for batch_step ,data in enumerate(dataloader):
        g_loss =loss/ step   
        g_loss.backward()
        if (batch_step +1) % step == 0:                   
           optimizerG.step()
           optimizerG.zero_grad()

But when I change the normal training like this,everything will be ok. I don’t know why,please help!!

    for batch_step ,data in enumerate(dataloader):
        optimizerG.zero_grad()
        g_loss =loss   
        g_loss.backward()                  
        optimizerG.step()