The goal is to accumulate gradients and then on the N timestep update the model. Not sure how to do it?
I’m not sure this works:
On every timestep call loss.backward() and then on N iteration: call optimizer.step(); optimizer.zero_grad()
Would this work or would the gradients calculated by loss.backward() be overwritten every time step?