Hi,
I think the big difference with tensorflow is the following.
Since you use a static graph, you define exactly what should be done to make one gradient computation/update. And then you just tell it to do it using a given input/target.
In pytorch, it is significantly more flexible as the autograd engine will just “remember” how to compute the gradient for a given variable while you are performing computations with this Variable. This means that you can get the gradients wrt a variable, then perform computation with it again, then recompute gradients corresponding to these new operations.
In this scheme, there is a not a single point where you stop performing “forward” operations and you know that the only thing that is left to be done is compute the gradients. So it is trickier to automatically set the gradients to 0 because you don’t know when a computation end, and when a new starts.
An example where the gradient accumulation is useful is for example if you share some part of a network for two different tasks:
input = Variable(data)
# Get the features
features = feature_extractor(input)
# Compute first loss and get the gradients for it
loss1 = task1(features)
loss1.backward(retain_graph=True)
# This add the gradients wrt loss1 in both the "task1" net and the "feature_extractor" net
# So each parameter "w" in "feature_extractor" has it gradient d(loss1)/dw
# Perform the second task and get the gradients for it as well
loss2 = task2(features)
loss2.backward()
# This will add gradients in "task2" and accumulate in "feature_extractor"
# Now each parameter in "feature_extractor" contains d(loss1)/dw + d(loss2)/dw
So the fact that the gradients are accumulated allows you to get the correct gradient for all the computations that you do with a given Variable even if you use it at multiple places in convoluted ways.
The drawback here is that you have to manually reset the values to 0 so that the gradients computed previously do not interfere with the ones you are currently computing.