Memory when storing states in a list

From this thread i learned that I always store the whole computational graph when I append a tensor to a list.

To fix this I could detach the tensor before storing it, however in my case I need them for gradient computation. For each pixel of an image I have to store one state tensor, as soon as all state tensors are computed I can compute the output. If I use more layers I need to store n times the amount of tensors.

Currently I use 64x84 images and with two layers of 256 units I already run out of memory on my universities cluster.

Is there some way of reducing the storage needed or am I just not able to use larger numbers with my architecture?

If your want to backprop through all these, then you need to have them.

An alternative is to trade memory for compute using the torch.utils.checkpoint module.