There have been a couple of general discussions about accumulating gradients before before (How to implement accumulated gradient？ and How to implement accumulated gradient in pytorch (i.e. iter_size in caffe prototxt)).

But I want to know how loss momentum behaves when we’re training a model like this:

```
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# suppose we want to accumulate gradients for 'N' steps:
for i, input in enumerate(data_loader):
output = model(input)
loss = criterion(output, input) / N
loss.backward()
if i % N == 0 and i > 0:
optimizer.step()
optimizer.zero_grad()
```

Are the loss momentum values overwritten at each call of `loss.backward()`

or (desirably) when `optimizer.step()`

is called?