Hi,
I will try to explain my question through this toy problem:
Lets say I have a simple neural network consisting of a fully connected layer followed by an LSTM. The network will continuously receive inputs [X_1, X_2, X_3, …] and two iterations after each input, it is expected to output some value dependent on the input received 2 steps prior. In this case, the outputs will be [_, _, f(X_1), f(X_2), f(X_3), …], where _ indicates that we ignore the first two outputs, and f(X_n) is the output where the loss is based on the target that was paired with X_n.
However, the key difference in my approach is that I would like to backpropagate through the network after each output f(X_n), retaining only the gradients that contain information related to f(X_m) where m > n. Hence, in my example, after backpropagating based on the output f(X_1), I would like to remove the gradient that was obtained in the first step. In other words, I would like to treat the hidden state that is fed into the LSTM at the same time as input X_2 as simply an input without any connections prior to that, and remove the gradients that were calculated for the fully connected layer after X_1 was passed through it, but before X_2 was passed through it.
I believe this can be partially solved by having multiple optimizers, however I think that would not work with standard optimizers like ADAM.