I am implementing an attention-like mechanism autoregressively, which means I will need to accumulate a tensor containing all previous GRU hidden states and do a matrix multiplication at each timestep. To accumulate this tensor, I could use .cat(), but as memory is likely to be the bottleneck during training, this seems a bit sketch.

My current solution (does not work) was to pre-allocate a tensor with zeros, then just assign hidden states to it at each timestep, while simultaneously using it in a matrix multiplication operation. However, this returns an error that one of the variables needed for gradient computation has been modified by an inplace operation, as expected.

My question is, how would one accumulate a tensor in this way in a memory-efficient way? In the paper that introduced the technique I am using (SHA-RNN), this was praised as being computationally efficient due to the way you can accumulate this tensor and not repeat computations, but I am not seeing how this is possible at the moment.