# How are losses aggregated over multiple computed batches?

According to this and this posts, it is possible to simulate a large SGD batch size by executing multiple smaller batches, while calling the loss + `loss.backward()` on each of them, and only finally calling `optimizer.step()` - see below the quoted code from the original post.

The question is regarding the optimization step: So, our big batch consists of many small batches that were computed one by one. Furthermore, each small batch had its gradient computed separately - so before applying the optimization step of the big batch, there are many gradient vectors to aggregate into a single gradient vector. How does it aggregate them (average, sum)? It is a very important detail to know in order to determine the learning rate.

Update: If I understand it correctly, according to experiments that I have conducted (and answers by other users to this question), the gradient is aggregated using sum. This means that the code sample below could be somewhat misleading - users may want to change their code such that each batchâ€™s loss will be multiplied by its relative size. Since the below example uses 10 â€śsmall batchesâ€ť, the loss should typically be multiplied by `0.1` for each small batch: `loss = crit(pred, target) * 1/10.0`. Otherwise, the learning rate should be adjusted to comply with the fact that the gradient has the size of 10 summed gradients, which is the same as using a 10x higher learning rate.

You can do a simple litmus test for this by trying some computation with known constant grads (e.g., addition).

``````>>> import torch
>>> a.sum().backward()
tensor([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
>>> a.sum().backward()