Hello. I want to follow the general (simplified) pseudocode structure for training my model:
# below is all simplified code for one epoch model.zero_grad() for sample, target in dataset: # note just one sample out = model(sample) loss = criterion(out, target) loss.backward() # accumulate gradients # I turned on no grad here to be safe for param in model.parameters(): param.copy_(param - lr * param.grad / total_samples)
However, I noticed that the model is simply always collapsing to predict the same value for everything. I have confirmed that I can learn something useful if I increase the batch size and manually perform the updates after every backward call but I am doing an expirement where I am trying to accumulate the gradients across every sample.
Is this not a valid approach? Am I doing something incorrectly from the pytorch side of things? Thanks!