How to implement accumulated gradient?

Sorry for the late reply. In my implementation, I am assuming that you want to fit the old_mini_batch_size number of training instances, but because of the GPU memory constraint you can’t. So you divide this old_mini_batch_size into iter_size smaller mini batches such that:

old_mini_batch_size = iter_size x minibatch_size

For the first and second implementation both, the training batch size is mini_batch_size and I am exploring two ways you can back propagate the gradients. First implementation doesn’t accumulate the gradients and keep the the entire graph in the memory. Whereas, the second implementation computes the gradient of a mini-batch (of size minibatch_size) and accumulates the computed gradients and flushes the memory. Keep in mind that the

optimizer.zero_grad()

zeros all gradients and when you do:

loss.backward()

you are adding the newly computed gradients to previous gradient values.

Blockquote If I understand correctly, in the first case, every iteration extends the graph (in the loss = loss + criterion(…) line) but the backward() function is then only called once per minibatch, while in the second version, the graph is always the same, but backward() gets called on every example in the minibatch.

Your understanding is wrong here.

for  i in range(iter_size):

The iter_size is the number of times you accumulate the gradients of a mini-batch. Hence in my first formulation, you keep on adding the loss, that implies you need to keep iter_size x minibatch worth of data in the GPU memory. And when you call .backward() after the for loop, you release all the data in the buffer that has to be used for the backward pass.

But, in my second implementation,

for  i in range(iter_size):
    loss = criterion(output, target_var) / iter_size
    loss.backward()
    loss_sum += loss

I am doing backward() after every small mini-batch. This flushes mini-batch every time. thus you consume small GPU memory.
Now answering your question, if you have limited memory size in the GPU, you should use the second implementation. In the second approach, you can decide mini-batch of size 1 to whatever that can fit into you GPU in one forward-backward pass. Don’t forget to divide the loss by iter_size to normalize the gradient. The second version will give you same result as if you are having larger mini-batch size. Now some people reported that performance differs based in minibatch_size. It shouldn’t, there should be some normalization of gradients issue.

16 Likes