GPU memory usage on long seq2seq sequences


I’m trying to optimise memory requirements for seq2seq decoder when every input for decoder is taken from previous step’s output (non-teaching mode).

In that case, I can’t use pack_padded_sequence method and execute RNN on full batch, but iterate over sequences offsets, accumulating loss from every step. From my experiments, I found that in such mode, GPU memory consumption becomes almost linear to input sequence length, which, in case of even small LSTM nets limits sequences to 300-400.

I’ve implemented small demonstration tool, which generates random batch of sequences and iterates the loop over their offsets:

Results from it’s run on gtx 1080ti and pytorch 0.1.12:

  • seq_len=50 -> 1.5GB
  • seq_len=200 -> 5.0GB
  • seq_len=300 -> 7.2GB
  • seq_len=400 -> 9.5GB

I guess, such high memory consumption is due to gradients accumulated during loss summing.

So, my question is: is it possible to optimise memory consumption in such case?

My understading of this that gradients for every LSTM matrix should be aggregated somehow among all sequence steps, but they are retained in separate buffers until final .backward() call. Is it possible to achive this?

Other option would be to call .backward() for every sequence step, but in this case, it doesn’t look like an option, as decoding step is preceeded with encoder run, and I’m not sure that encoder’s gradients will be valid.


1 Like

Recommend upgrading to pytorch 0.2. This might fix your memory issue?

I don’t think so. My issue doesn’t look like a bug, it’s more like an issue of way pytorch calculates gradients (tape gradients calculation) and how I’m using my RNN.

Something I really don’t understand is whether RNN are unrolled or not and how to choose. In Keras with Theano backend you can specify unroll = True or False depending on the memory vs speed tradeoff you want to make.

By the way in the specific case described here, whatever the rolling status of the LSTM, the memory consumption will grow due to the output you use:
net_out = nn.Linear(in_features=HIDDEN_SIZE, out_features=INPUT_SIZE)

Indeed :
Gives :
Linear (512 -> INPUT_SIZE)

As far as I understand whole machinery, Pytorch has dynamic graph, which allows you to decide how many steps to unroll your RNN. With simple architectures, it allows you to stop time batching at all (which you have to do in case of keras) and process whole batch of variable length sequences using one RNN call preceeded with pack_padded_sequence.

Unfortunately, it’s not working such smoothly in case of seq2seq architecture.

I had the same problem for an encoder-decoder model used for summarization. encoder sequences ~400 and decoder of ~120 max length for my 12G on TitanX. Aso this was on v0.2.

You could try using a different optimizer…one that doesnt track gradients…

Hm, that’s an interesting suggestion. Could you give more information about optimizers? I thought all of them just use gradients gathered from computation graph built on Variable operations.

Today I had different idea to split both input and output sequences on chunks of fixed size and update gradients at the end of each chunk. It should be similar to “RNN unrolling” in keras or TF, and in theory, can have negative effect on convergence, but looks like a solution to memory limitation. But haven’t tried it yet.

I tried your example without the enormous
net_out = nn.Linear(in_features=HIDDEN_SIZE, out_features=INPUT_SIZE)
layer, and it is not so terrible.

If you want to have a Linear layer after the LSTM, you have to repeat INPUT_SIZE times a single nn.Linear(in_features=HIDDEN_SIZE, out_features=1).

It doesn’t change our conceptual problem but you can build much more longer seq2seq architecture this way.

That’s something I don’t understand. Could you please explain?

My understanding that net_out is not too large (51210) compared to LSTM unit itself which has 4(512*10 + 512^2) i.e. hos 200 times more parameters. Additionally, with trainer mode of seq2seq, on every step of decoder RNN I need to apply linear layer to calculate next token for this step to feed in as input on next step.

Of course, in real-world applications, when count of output tokens will be full vocabulary (hundreeds of thousands or even millions of words), output layer will dominate. But this is completely different story which can be solved by hierarchical or sampled softmax.

My bad I got confused between INPUT_SEQ_LEN and INPUT_SIZE :frowning:
By the way with no GPU I got :
#input_seq_len = 500: // 2.8GB

Hm, that’s interesting. Maybe some cudnn issue/inefficiency.

After some tests, I don’t think there is any solutions. I was afraid weights were duplicated along the sequence, leading to a 512^2 * INPUT_SEQ_LEN memory usage, but weight sharing seems to be correctly implemented and it is not the case.
Still when backpropagating, we have to store the forward pass first, So whatever we do we will have a 512 * INPUT_SEQ_LEN consumption. The only solution is indeed to limit back-propagation through time to a limited number of step.