Feature will-be native mini-batching support for torch.autograd.backward. Probably they could not be made parallel because of memory saving goal.
It’s that you apply this gradient accumulation only in the middle of the model. If you do traditional gradient accumulation, you’d be underloading the GPU for the prior part of the model (the backbone).
Not sure if I follow, I need gradients both for x_mini and for neck.weight - that’s why I can’t use just torch.autograd.grad. I’ve tried and got an error:
#x_mini.retain_grad()
y_mini = ctx.module(x_mini)
torch.autograd.backward(y_mini, g_mini, inputs = x_mini)
RuntimeError: One of the differentiated Tensors given as 'inputs' to backward is not a leaf Tensor
That should not happen on the latest version. We added that
Feature will-be native mini-batching support for torch.autograd.backward. Probably they could not be made parallel because of memory saving goal.
The thing is that this is not really an autograd thing, this is only true under a specific assumption that you have a batch of independent samples. Which is not true in general when not considering cummulative loss or(as soon as you have a batchnorm.
The workaround of using a for-loop looks simple enough here.