From this thread i learned that I always store the whole computational graph when I append a tensor to a list.
To fix this I could detach the tensor before storing it, however in my case I need them for gradient computation. For each pixel of an image I have to store one state tensor, as soon as all state tensors are computed I can compute the output. If I use more layers I need to store n times the amount of tensors.
Currently I use 64x84 images and with two layers of 256 units I already run out of memory on my universities cluster.
Is there some way of reducing the storage needed or am I just not able to use larger numbers with my architecture?