Well, tried to get the answer out of chatgpt, but it was too easily influenced by my assumptions.
Let’s say I have a loss that is calculated after some intermediate steps. Also, the optimizer is setup using the model’s parameters, so we have.
input => model’s parameters => intermediate steps => loss
and when we loss.backward()
input (accumulated gradient) => model’s parameters (accumulated gradient) => intermediate steps (accumulated gradient) => loss
then we optimizer.step() and optimizer.zero_grad(), so does this zero_grad() zero all accumulated gradients, or only the model’s parameters gradients since the optimizer was only initialized with the model’s parameters? that is, compare these
input => model’s parameters => intermediate steps => loss
input (still has accumulated gradient, albeit not relevant for future calculations) => model’s parameters (zeroed) => intermediate steps (still has accumulated gradient, interferes with future backprops?) => loss
oh wait, I am stupid, I think. requires_grad==False by default, but you can still backward() a grad through a tensor that has requires_grad==False, so that’s why it’s fine, ye I am dumb, maybe just someone confirm this if true just to make sure