I want to understand why in torch (and subsequently pytorch), we made the choice to do zero gradients explicitly. Why can’t gradients be zeroed when loss.backward() is called. What scenario is served by keeping the gradients on the graph and asking the user to explicitly zero the gradients ?
Does anybody have a good answer for this one
- gradient accumulation means we can do batch descent without having to have the batch fit completely in memory.
- if one layer (or weight) is used multiple times, in the backward phase it has to accumulate.
- If our setting is to accumulate gradients, we have to zero them at some point. And we cant just zero-them on loss.backward, because (1).