There have been a couple of general discussions about accumulating gradients before before (How to implement accumulated gradient? and How to implement accumulated gradient in pytorch (i.e. iter_size in caffe prototxt)).
But I want to know how loss momentum behaves when we’re training a model like this:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# suppose we want to accumulate gradients for 'N' steps:
for i, input in enumerate(data_loader):
output = model(input)
loss = criterion(output, input) / N
loss.backward()
if i % N == 0 and i > 0:
optimizer.step()
optimizer.zero_grad()
Are the loss momentum values overwritten at each call of loss.backward()
or (desirably) when optimizer.step()
is called?