Loss momentum behavior when accumulating gradients

There have been a couple of general discussions about accumulating gradients before before (How to implement accumulated gradient? and How to implement accumulated gradient in pytorch (i.e. iter_size in caffe prototxt)).

But I want to know how loss momentum behaves when we’re training a model like this:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# suppose we want to accumulate gradients for 'N' steps:
for i, input in enumerate(data_loader):
    output = model(input)
    loss = criterion(output, input) / N
    loss.backward()

    if i % N == 0 and i > 0:
        optimizer.step()
        optimizer.zero_grad()

Are the loss momentum values overwritten at each call of loss.backward() or (desirably) when optimizer.step() is called?

Hi Ali,
The loss variable is reassigned every iteration due to the line.

loss = criterion(output, input) / N

The momentum of the gradient is maintained by the optimizer for the parameters/tensors that it is tracking, in this case those are model.parameters().
Hope this helps.