I am modeling k-dimensional positions over time t = 0…T using a set of initial positions Z0 with requires_grad=True and storing the results in Z with requires_grad=False for the remaining T-1 time steps.
A simple model is Zt = Zt-1 + e where e is some constant noise. Which is optimized in PyTorch using gradient descent, by moving the initial positions accordingly.
The problem is, when using Z to compute subsequent time steps for t > 1, the relation between Zt and Z0 is lost, such that the model converges significantly slower opposed to simply modeling Zt = Z0 + t * e, where the dependency between initial positions and Zt is retained.
Note: This model is for illustrative purposes only, such that the models in question are too complex to be defined in terms of Z0, requiring the intermediary results of Z.
Accumulating gradients or retaining gradient graph does not help.