How to implement accumulated gradient?

Note that, keeping the learning rate constant, it is important to feed the optimizer same gradients before and after using this trick. If we don’t use the trick of accumulating, we would be computing the gradient like this:

"blah-blah"
optimizer.zero_grad()
loss = 0
minibatch_size = old_batch_size / iter_size
for i in range(iter_size):
    # output here as the size of minibatch
    loss += criterion(output, target_var)
loss = loss / iter_size
loss.backward()
optimizer.step()

But when we are using this trick, we need to make sure that the accumulated gradient’s mean should be same as before.
So, we divide the loss everytime with the iter_size such that after summing up, gradients come out to be the same.

optimizer.zero_grad()
loss_sum = 0
for  i in range(iter_size):
    loss = criterion(output, target_var) / iter_size
    loss.backward()
    loss_sum += loss
optimizer.step()

If you divide by the iter_size, you don’t need to change the learning rate. If you don’t, then you should divide the learning rate by iter_size to have the same performance. I am considering that you are using SGD as the optimizer.

24 Likes