How to implement accumulated gradient?

Hi Mori,

Sorry I didn’t figure it out either. I just used the code snippet as above to train my model. The model can converge, but the accuracy is 2% lower than my Caffe code. I don’t know whether this is the framework difference, or somewhere of my implementation is wrong. Maybe we wait for their answers. @apaszke @smth . Thank you.

this code looks good. the lower accuracy must be because of some other subtle reason.

Hi smth,

So, eventually, there is no necessity to divide loss with iter_size?
I’m still bit confusing since apaszke mentioned about dividing here.

Thank you for your help.

dividing loss by iter_size might be the subtle reason :slight_smile:
I just meant that the code didn’t have any glaring errors.

@MORI @smth Yeah, that may be the subtle reason. Thanks for pointing it out.

But actually, I don’t really understand. From my perspective, loss is calculated for each mini-batch samples. The gradients are accumulated if we don’t reset it, but the loss is not accumulated. Is my understanding correct? Or the loss is also summed for all the iterations inside the for loop?

So if the loss is not accumulated, why do we have to divide it by the iter_size? Thank you so much if you can explain more to help. :smiley:

Note that, keeping the learning rate constant, it is important to feed the optimizer same gradients before and after using this trick. If we don’t use the trick of accumulating, we would be computing the gradient like this:

loss = 0
minibatch_size = old_batch_size / iter_size
for i in range(iter_size):
    # output here as the size of minibatch
    loss += criterion(output, target_var)
loss = loss / iter_size

But when we are using this trick, we need to make sure that the accumulated gradient’s mean should be same as before.
So, we divide the loss everytime with the iter_size such that after summing up, gradients come out to be the same.

loss_sum = 0
for  i in range(iter_size):
    loss = criterion(output, target_var) / iter_size
    loss_sum += loss

If you divide by the iter_size, you don’t need to change the learning rate. If you don’t, then you should divide the learning rate by iter_size to have the same performance. I am considering that you are using SGD as the optimizer.


@zhuyi490 Have you ever tested your code with minibatch_size=1?
I’ve tested my code with (iter_size=2,minibatch_size=2) and (iter_size=1, minibatch_size=4) However, when I set the iter_size=4 and minibatch_size=1, accuracy became pretty low.

Thanks for your reply, it helps. i think you are right.

no I didn’t test it with mini-batch size 1. Actually I never used batch size equal to 1 because of unstable performance.

@Gopal_Sharma I can see why the two approaches are identical mathematically, but what is the difference computationally?

If I understand correctly, in the first case, every iteration extends the graph (in the loss = loss + criterion(...) line) but the backward() function is then only called once per minibatch, while in the second version, the graph is always the same, but backward() gets called on every example in the minibatch.

So which of the two solutions would be preferable and why? I am not sure I understand how much bigger the graph would need to get in the first version and which parts of the graph would need to be kept around until zer_grad is called again. But I suppose it depends on the relative cost of this versus calling backward()?

Sorry for the late reply. In my implementation, I am assuming that you want to fit the old_mini_batch_size number of training instances, but because of the GPU memory constraint you can’t. So you divide this old_mini_batch_size into iter_size smaller mini batches such that:

old_mini_batch_size = iter_size x minibatch_size

For the first and second implementation both, the training batch size is mini_batch_size and I am exploring two ways you can back propagate the gradients. First implementation doesn’t accumulate the gradients and keep the the entire graph in the memory. Whereas, the second implementation computes the gradient of a mini-batch (of size minibatch_size) and accumulates the computed gradients and flushes the memory. Keep in mind that the


zeros all gradients and when you do:


you are adding the newly computed gradients to previous gradient values.

Blockquote If I understand correctly, in the first case, every iteration extends the graph (in the loss = loss + criterion(…) line) but the backward() function is then only called once per minibatch, while in the second version, the graph is always the same, but backward() gets called on every example in the minibatch.

Your understanding is wrong here.

for  i in range(iter_size):

The iter_size is the number of times you accumulate the gradients of a mini-batch. Hence in my first formulation, you keep on adding the loss, that implies you need to keep iter_size x minibatch worth of data in the GPU memory. And when you call .backward() after the for loop, you release all the data in the buffer that has to be used for the backward pass.

But, in my second implementation,

for  i in range(iter_size):
    loss = criterion(output, target_var) / iter_size
    loss_sum += loss

I am doing backward() after every small mini-batch. This flushes mini-batch every time. thus you consume small GPU memory.
Now answering your question, if you have limited memory size in the GPU, you should use the second implementation. In the second approach, you can decide mini-batch of size 1 to whatever that can fit into you GPU in one forward-backward pass. Don’t forget to divide the loss by iter_size to normalize the gradient. The second version will give you same result as if you are having larger mini-batch size. Now some people reported that performance differs based in minibatch_size. It shouldn’t, there should be some normalization of gradients issue.


Thank you for your answer.
I found a problem that, compared with the training on more GPUs with the same batch size, this accumulated gradient method cannot solve the batch normalization problem, right?

Yeah. Batch normalization is tricky to get right in multi-gpu setting. This is mainly because BN requires calculating mini-batch mean and thus require information of tensors on other gpus. Communication (sharing) between gpu is costly.

1 Like

Any idea on how to deal with BatchNorm2d when accumulating gradients?
It seems that BatchNorm2d updates the running mean and standard-deviation during the forward pass (see here).

I have a mini-batch size of n samples. I forward one sample at a time. (the loss function is divided by n). In this setup, I obtain bad performance compared to when I forward more than one samples at once (8 for instance). I expect to obtain the same result since the final accumulated gradient should be the same. I suspect that this has something to do with the BatchNorm2d in my model.

I use nn.CrossEntropyLoss(reduction=‘sum’) as a loss, and I divide it by the size of the mini-batch (i.e., n) when called.

Thank you!

dont mass-tag people. it’s a first warning.

Sorry. I removed them.

Your general approach is right and I also assumed than BatchNorm layers might be a problem in this case.
If you just have very few samples in each forward pass, you could use InstanceNorm or GroupNorm instead, which should work better for small batch sizes.
Alternatively, you could also try to change the momentum of BatchNorm, but I’m not sure, if that will really help a lot.

Do you only need to divide by iter_size if you the loss function takes the average?
Let’s say if I’m doing sum of squared errors, should I call backward() without dividing?

so unsurity about how to handle the batch norm with accumulated gradient still remains ?
I dint find any blog where i could get a solution ,or confirmation that without adjusting up batchnorm stats we can get benefit out of Grad accumulation.

I’m not aware of any blog and would recommend to look at other implementations, which successfully use gradient accumulation, such as NVIDIA’s DeepLearningExamples.
Based on a quick search it seems that Bert, Jasper, FastPitch, MaskRCNN, Transformer, TransformerXL, and NCF have a flag to set the gradient accumulation steps. You could take a look at some models and check, if the batchnorm layers (especially the momentum) are changed or if batchnorm is just not used.