Why do we need to set the gradients manually to zero in pytorch?

To be able to call .backward() twice.
That is also the answer to your question above: calling backward twice is allowed if you specify retain_graph=True in the first call.

1 Like

darn it your right. I didn’t see your first call to backward had that flag. Ooops!

loss.backward(retain_graph=True) # needed to call backward again

So if retain_graph=False then after the first backward the graph will be freed and calling the second .backward would cause an error, am I correct?

Yes you are correct.

Note that in some cases, you won’t see the error because there was nothing saved. And so nothing was freed :wink: But you shuld not rely on it.

1 Like

Thanks for the answering. I found the loss would increase without optimizer.zero_grad().

Link: https://cs231n.github.io/optimization-2/
The forward expression involves the variables x,y multiple times, so when we perform backpropagation we must be careful to use += instead of = to accumulate the gradient on these variables (otherwise we would overwrite it). This follows the multivariable chain rule in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add.

Just to double check.

It only accumulate it does not divide by the total number of elements it accumulated, right?

(I have a loop over a batch size cuz its so big and special)

Yes the backward pass only accumulates into the exiting .grad when it exists.

1 Like

Hi,

I have extra question about BN layer here.

So if we want to use the variable sized input and we do this kind of gradient accumulation. Lets say the ‘real batch size’ we want is 32. So we iter the batch for 32 times and do a optim.step(). So my question is, compared to the normal way(which have batch_size of 32 and do a optim.step every iteration), how will the Batch Normalization affect the results since in the forward() process we only input batch with size 1?

Batchnormalization with inputs of size 1 during training might not behave very well.
It usually expect a batch of several samples to be able to do the normalization properly.

Thanks for prompt reply!

So if we want to input the variable sized image to our network using above grads accumulation method, how should we deal with the batchnorm layers in the original model? Should we just delete the BN layers or use a model.eval() mode for training (I have seen someone done this for their training so that the BN would not be used in forward() process)?

And lets say if we abandon all the BN layers of a existing model in the training, how will this affect the model performance?

Thanks!

Unfortunately, I can’t think of a way to still use batchnorm as it needs the values from all the batch to proceed.
The effect on final performance will depend a lot on your model/application/dataset , you’ll have to try it in your context and see how it works.

Thanks. I will do that.

@albanD in very recent posts about grad accumulation i read that Grad acc has its benefits only if you also align batch norm statistics accordingly.
But i couldnt figure out what needs to be changed for batch norm if we accumulate more gradients

  1. Forward pass activations calculation ,that is changing /N to Number of steps x N
  2. Backward pass with similar change
    or any thing else ?
    I am not getting a good ans so far in any forums about this

I’m not sure what you mean by “align statistics accordingly”. My only answer here is that it won’t do the same thing as if you had a single batch. But I don’t know how this should be updated to work with batchnorm.

What do you do if the number of samples (size of dataset) is not a multiple of accumulation steps? Let’s say that the dataset has three samples and we accumulate every two steps. That means that the last sample is not optimised. Depending on the size of accumulation w.r.t. the size of the dataset, that can be problematic. How does one typically deal with that?

Hi,

I would say that depend a lot on your use case. The only ones I’ve seen right now, the dataset size has been much larger than the accumulation size and so the partial batch/accumulations are just dropped and will most likely be used in the next epoch.
In your case, you might want to make the accumulation roll over to the next epoch?
Note that as long as you stop before doing the iteration if you know you won’t do it until the end and you randomly sample your dataset, it won’t actually be slower to just discard partial accumulations. You will just run each epochs faster.

Thanks for the quick reply! I do not actually have this problem but I was wondering about it from a theoretical perspective.“Rolling over” is indeed what I figured would be best, or use a randomized “infinite” dataloader where you simply do not count in epochs but steps.

Thanks a lot! I could not understand what does accumulating the gradient meant.

Hi,doing the grad accumulation using this way. But I found that loss will increase.

    for batch_step ,data in enumerate(dataloader):
        g_loss =loss/ step   
        g_loss.backward()
        if (batch_step +1) % step == 0:                   
           optimizerG.step()
           optimizerG.zero_grad() 

But when I change the normal training like this,everything will be ok. I don’t know why,please help!!

    for batch_step ,data in enumerate(dataloader):
        optimizerG.zero_grad()
        g_loss =loss   
        g_loss.backward()                  
        optimizerG.step()