If you use the same weight you will have to do the for loop I’m afraid.
Note that in your code, you want to do weights.grad.clone() on the last line. All changes to .grad are inplace and so after you do weight.zero_grad(), your buffer will contain only zeros if you don’t clone.
Also; I’m wondering that for a deep network parameters. What is the best structure to save the the most recent iteration’s gradients and each gradients can also be access through similar thing as index(list isn’t the best choice here I think)? I don’t need the all history but only the gradient information from last iteration.
For more details, I’m trying to implement algorithm 1 in this paper
I’m afraid you will have to do the bookeeping by hand and potentially implement a new optimizer.
As an example, you can look how rms prop handles such bookeeping.