Loss momentum behavior when accumulating gradients

Ali2500 · July 26, 2019, 5:10pm

There have been a couple of general discussions about accumulating gradients before before (How to implement accumulated gradient？ and How to implement accumulated gradient in pytorch (i.e. iter_size in caffe prototxt)).

But I want to know how loss momentum behaves when we’re training a model like this:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# suppose we want to accumulate gradients for 'N' steps:
for i, input in enumerate(data_loader):
    output = model(input)
    loss = criterion(output, input) / N
    loss.backward()

    if i % N == 0 and i > 0:
        optimizer.step()
        optimizer.zero_grad()

Are the loss momentum values overwritten at each call of loss.backward() or (desirably) when optimizer.step() is called?

Mazhar_Shaikh · July 27, 2019, 11:52am

Hi Ali,
The loss variable is reassigned every iteration due to the line.

loss = criterion(output, input) / N

The momentum of the gradient is maintained by the optimizer for the parameters/tensors that it is tracking, in this case those are model.parameters().
Hope this helps.