Why do we have to zero gradients explicitly

Mika_S · June 15, 2017, 8:11am

I want to understand why in torch (and subsequently pytorch), we made the choice to do zero gradients explicitly. Why can’t gradients be zeroed when loss.backward() is called. What scenario is served by keeping the gradients on the graph and asking the user to explicitly zero the gradients ?

Mika_S · June 22, 2017, 9:29pm

Ping.

Does anybody have a good answer for this one

smth · June 22, 2017, 9:50pm

gradient accumulation means we can do batch descent without having to have the batch fit completely in memory.
if one layer (or weight) is used multiple times, in the backward phase it has to accumulate.
If our setting is to accumulate gradients, we have to zero them at some point. And we cant just zero-them on loss.backward, because (1).