Hi, All.
I got a question about mini-batch. Say I have sentences with variable lengths within one mini-batch, I want to update the parameters until I feed the whole mini-batch. The code may look like this.
model.zero_grad()
for sample in batch:
loss = model(sample)
loss.backtrack()
optim.step()
I check from the doc that the gradients of leaf nodes are accumulated. I wonder if the gradients of non-leaf variables are accumulated in this case.
I know there are options like padding or sorting the samples, but I’m still curious if it’s appropriate to do it in this way.
Thanks