How are losses aggregated over multiple computed batches?

According to this and this posts, it is possible to simulate a large SGD batch size by executing multiple smaller batches, while calling the loss + loss.backward() on each of them, and only finally calling optimizer.step() - see below the quoted code from the original post.

The question is regarding the optimization step: So, our big batch consists of many small batches that were computed one by one. Furthermore, each small batch had its gradient computed separately - so before applying the optimization step of the big batch, there are many gradient vectors to aggregate into a single gradient vector. How does it aggregate them (average, sum)? It is a very important detail to know in order to determine the learning rate.

Update: If I understand it correctly, according to experiments that I have conducted (and answers by other users to this question), the gradient is aggregated using sum. This means that the code sample below could be somewhat misleading - users may want to change their code such that each batch’s loss will be multiplied by its relative size. Since the below example uses 10 “small batches”, the loss should typically be multiplied by 0.1 for each small batch: loss = crit(pred, target) * 1/10.0. Otherwise, the learning rate should be adjusted to comply with the fact that the gradient has the size of 10 summed gradients, which is the same as using a 10x higher learning rate.

You can do a simple litmus test for this by trying some computation with known constant grads (e.g., addition).

>>> import torch
>>> a = torch.ones(3,3, requires_grad=True)
>>> a.grad
>>> a.sum().backward()
>>> a.grad
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
>>> a.sum().backward()
>>> a.grad
tensor([[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]])

It is doing a sum of the gradients for each backward.

1 Like