Similar to this post, I would like to do multiple toward pass before backward, but on the same model, Effectively simulating a very large batch-size (like Gradient decent) which would not fit in any GPU.
more concretely,
is it possible to somehow accumulate the gradient information and perform GD on large dataset using pytorch and if it is possible, how can I do it?
suppose you forward pass on 6 datapoints (6 images) and you want to perform GD on 36 datapoints (images)
Let dataloader load input with dimension [6x3x224x224] then
for i,(input,target) in enumerate(dataloader):
output = model(input)
loss = lossfn(output, target)
loss.backward() #Only stores the gradients at all nodes
if (i+1)%6==0:
loss = loss / 6 # Since we add up loss for 6 minibatches, we would takes mean of loss at end
optim.step() #Uses the gradients to backpropogagte after 6 batches
optim.zero_grad()