Zero_grad() zeros intermediate steps as well?

Well, tried to get the answer out of chatgpt, but it was too easily influenced by my assumptions.

Let’s say I have a loss that is calculated after some intermediate steps. Also, the optimizer is setup using the model’s parameters, so we have.

input => model’s parameters => intermediate steps => loss

and when we loss.backward()

input (accumulated gradient) => model’s parameters (accumulated gradient) => intermediate steps (accumulated gradient) => loss

then we optimizer.step() and optimizer.zero_grad(), so does this zero_grad() zero all accumulated gradients, or only the model’s parameters gradients since the optimizer was only initialized with the model’s parameters? that is, compare these

input => model’s parameters => intermediate steps => loss
input (still has accumulated gradient, albeit not relevant for future calculations) => model’s parameters (zeroed) => intermediate steps (still has accumulated gradient, interferes with future backprops?) => loss

oh wait, I am stupid, I think. requires_grad==False by default, but you can still backward() a grad through a tensor that has requires_grad==False, so that’s why it’s fine, ye I am dumb, maybe just someone confirm this if true just to make sure

Intermediate steps don’t have accumulated gradients as they are stored only temporarily and recreated in each forward step.
The gradients will be accumulated to the .grad attribute of the used leaf variables, i.e. parameters and the optimizer will zero these attributes from all passed parameters.

1 Like

if i understand correctly, non-leaf nodes in the computational graph do not store any gradients, so they can be backpropped through multiple times without any “accumulated gradient reset”-process required. zeroing gradients is only required for leaf nodes, which store the gradients as a numerical value?

Yes, intermediate forward activations are used to compute the gradients and do not receive gradients themselves. After a backward pass they will be deleted to save memory and wou won’t be able to reuse them multiple times unless you specify .backward(retain_graph=True).

1 Like