and self mini batch implementation like bellow(because each data has huge size)
for i in range(len(batch_list)):
output = network(batch_list[i]) #non scalar output
grad = grad_fn(output, teacher[i]) #calc non scalar grad
torch.autograd.backward([output],[grad]) #non scalar backwarding
optimizer.step()
optimizer.zero_grad()
It seems my Network isn’t training correctly.
I think the section I cropped has a problem.
(Note that, for the simplicity the code isn’t original.
#Quote Mr SimonW
All autograd does is just to calculate the gradient, it has no notion of batching, and I don’t see how it can have different behavior with different batching mechanism.
You can divide the grad with the length of batch. It is needed especially when the batches are of different lengths.
If the batches are of same length, dividing by length of batch size can be thought to be absorbed by learning rate anyway, in which case its not needed.