Hi, I was wondering how can I accumulate gradient during gradient descent in pytorch (i.e. iter_size in caffe prototxt), since a single GPU can’t hold very large models now. I know here already talked about this, but I just want to confirm my code is correct. Thank you very much. I attach my code s…

Hi zhuyi490, Sorry about this is not answer, though, since I have the same question, if you had figured out the correct way, can you share? Especially, we should or shouldn’t dived loss with iter_size is not clear for me. I though you used loss_mini_batch just used to show log. Thanks.

Hi Mori, Sorry I didn’t figure it out either. I just used the code snippet as above to train my model. The model can converge, but the accuracy is 2% lower than my Caffe code. I don’t know whether this is the framework difference, or somewhere of my implementation is wrong. Maybe we wait for their …

this code looks good. the lower accuracy must be because of some other subtle reason.

Hi smth, So, eventually, there is no necessity to divide loss with iter_size? I’m still bit confusing since apaszke mentioned about dividing here. [image] PyTorch Gradients Normally when we’re doing backprop we would do the following: loss.backward() # This calculates…

dividing loss by iter_size might be the subtle reason :slight_smile: I just meant that the code didn’t have any glaring errors.

@MORI @smth Yeah, that may be the subtle reason. Thanks for pointing it out. But actually, I don’t really understand. From my perspective, loss is calculated for each mini-batch samples. The gradients are accumulated if we don’t reset it, but the loss is not accumulated. Is my understanding correct…

Note that, keeping the learning rate constant, it is important to feed the optimizer same gradients before and after using this trick. If we don’t use the trick of accumulating, we would be computing the gradient like this: "blah-blah" optimizer.zero_grad() loss = 0 minibatch_size = old_batch_size …

@zhuyi490 Have you ever tested your code with minibatch_size=1? I’ve tested my code with (iter_size=2,minibatch_size=2) and (iter_size=1, minibatch_size=4) However, when I set the iter_size=4 and minibatch_size=1, accuracy became pretty low.

Thanks for your reply, it helps. i think you are right.

no I didn’t test it with mini-batch size 1. Actually I never used batch size equal to 1 because of unstable performance.

How to implement accumulated gradient？

vision

MORI June 22, 2017, 4:10am 5

Hi smth,

So, eventually, there is no necessity to divide loss with iter_size?
I’m still bit confusing since apaszke mentioned about dividing here.

Thank you for your help.

Pytorch update after single batch_size which exceeds the GPU memory