Multiple Backwards Calls

explorer90 · August 19, 2021, 9:04pm

Hello. I want to follow the general (simplified) pseudocode structure for training my model:

# below is all simplified code for one epoch
model.zero_grad()
for sample, target in dataset: # note just one sample
    out = model(sample)
    loss = criterion(out, target)
    loss.backward() # accumulate gradients

# I turned on no grad here to be safe
for param in model.parameters():
    param.copy_(param - lr * param.grad / total_samples)

However, I noticed that the model is simply always collapsing to predict the same value for everything. I have confirmed that I can learn something useful if I increase the batch size and manually perform the updates after every backward call but I am doing an expirement where I am trying to accumulate the gradients across every sample.

Is this not a valid approach? Am I doing something incorrectly from the pytorch side of things? Thanks!

ptrblck · August 20, 2021, 1:57am

The code looks generally alright. Since the updates per iteration seem to work, could you run a test by eg. reusing the same sample 10x in the gradient accumulation loop and check the parameter update afterwards. Based on your code snippet the update should be equal to a single update step (without gradient accumulation).