I am modeling *k*-dimensional positions over time *t = 0…T* using a set of initial positions *Z _{0}* with

**requires_grad=True**and storing the results in

*Z*with

**requires_grad=False**for the remaining

*T-1*time steps.

A simple model is *Z _{t} = Z_{t-1} + e* where

*e*is some constant noise. Which is optimized in PyTorch using gradient descent, by moving the initial positions accordingly.

The problem is, when using *Z* to compute subsequent time steps for *t > 1*, the relation between *Z _{t}* and

*Z*is lost, such that the model converges significantly slower opposed to simply modeling

_{0}*Z*, where the dependency between initial positions and

_{t}= Z_{0}+ t * e*Z*is retained.

_{t}Note: This model is for illustrative purposes only, such that the models in question are too complex to be defined in terms of *Z _{0}*, requiring the intermediary results of

*Z*.

Accumulating gradients or retaining gradient graph does not help.