How are losses aggregated over multiple computed batches?

tal123 · August 25, 2022, 2:41pm

According to this and this posts, it is possible to simulate a large SGD batch size by executing multiple smaller batches, while calling the loss + loss.backward() on each of them, and only finally calling optimizer.step() - see below the quoted code from the original post.

The question is regarding the optimization step: So, our big batch consists of many small batches that were computed one by one. Furthermore, each small batch had its gradient computed separately - so before applying the optimization step of the big batch, there are many gradient vectors to aggregate into a single gradient vector. How does it aggregate them (average, sum)? It is a very important detail to know in order to determine the learning rate.

Update: If I understand it correctly, according to experiments that I have conducted (and answers by other users to this question), the gradient is aggregated using sum. This means that the code sample below could be somewhat misleading - users may want to change their code such that each batch’s loss will be multiplied by its relative size. Since the below example uses 10 “small batches”, the loss should typically be multiplied by 0.1 for each small batch: loss = crit(pred, target) * 1/10.0. Otherwise, the learning rate should be adjusted to comply with the fact that the gradient has the size of 10 summed gradients, which is the same as using a 10x higher learning rate.

Why do we need to set the gradients manually to zero in pytorch?

# some code
# Initialize dataset with batch size 10
opt.zero_grad()
for i, (input, target) in enumerate(dataset):
    pred = net(input)
    loss = crit(pred, target)
    # one graph is created here
    loss.backward()
    # graph is cleared here
    if (i+1)%10 == 0:
        # every 10 iterations of batches of size 10
        opt.step()
        opt.zero_grad()

eqy · August 25, 2022, 4:44pm

You can do a simple litmus test for this by trying some computation with known constant grads (e.g., addition).

>>> import torch
>>> a = torch.ones(3,3, requires_grad=True)
>>> a.grad
>>> a.sum().backward()
>>> a.grad
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
>>> a.sum().backward()
>>> a.grad
tensor([[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]])
>>>

It is doing a sum of the gradients for each backward.