Hello. I want to follow the general (simplified) pseudocode structure for training my model:
# below is all simplified code for one epoch
model.zero_grad()
for sample, target in dataset: # note just one sample
out = model(sample)
loss = criterion(out, target)
loss.backward() # accumulate gradients
# I turned on no grad here to be safe
for param in model.parameters():
param.copy_(param - lr * param.grad / total_samples)
However, I noticed that the model is simply always collapsing to predict the same value for everything. I have confirmed that I can learn something useful if I increase the batch size and manually perform the updates after every backward call but I am doing an expirement where I am trying to accumulate the gradients across every sample.
Is this not a valid approach? Am I doing something incorrectly from the pytorch side of things? Thanks!