GPU memory usage on long seq2seq sequences

Shmuma · August 12, 2017, 6:45am

Hi!

I’m trying to optimise memory requirements for seq2seq decoder when every input for decoder is taken from previous step’s output (non-teaching mode).

In that case, I can’t use pack_padded_sequence method and execute RNN on full batch, but iterate over sequences offsets, accumulating loss from every step. From my experiments, I found that in such mode, GPU memory consumption becomes almost linear to input sequence length, which, in case of even small LSTM nets limits sequences to 300-400.

I’ve implemented small demonstration tool, which generates random batch of sequences and iterates the loop over their offsets: https://gist.github.com/Shmuma/614d3dbe0ad2805d048ff0e6129682aa

Results from it’s run on gtx 1080ti and pytorch 0.1.12:

seq_len=50 -> 1.5GB
seq_len=200 -> 5.0GB
seq_len=300 -> 7.2GB
seq_len=400 -> 9.5GB

I guess, such high memory consumption is due to gradients accumulated during loss summing.

So, my question is: is it possible to optimise memory consumption in such case?

My understading of this that gradients for every LSTM matrix should be aggregated somehow among all sequence steps, but they are retained in separate buffers until final .backward() call. Is it possible to achive this?

Other option would be to call .backward() for every sequence step, but in this case, it doesn’t look like an option, as decoding step is preceeded with encoder run, and I’m not sure that encoder’s gradients will be valid.

Thanks!

hughperkins · August 12, 2017, 9:38pm

Recommend upgrading to pytorch 0.2. This might fix your memory issue?

Shmuma · August 13, 2017, 6:43am

I don’t think so. My issue doesn’t look like a bug, it’s more like an issue of way pytorch calculates gradients (tape gradients calculation) and how I’m using my RNN.

Pierre-Bartet · August 16, 2017, 8:08am

Something I really don’t understand is whether RNN are unrolled or not and how to choose. In Keras with Theano backend you can specify unroll = True or False depending on the memory vs speed tradeoff you want to make.

By the way in the specific case described here, whatever the rolling status of the LSTM, the memory consumption will grow due to the output you use:
net_out = nn.Linear(in_features=HIDDEN_SIZE, out_features=INPUT_SIZE)

Indeed :
print(net_out)
Gives :
Linear (512 -> INPUT_SIZE)

Shmuma · August 18, 2017, 3:00pm

As far as I understand whole machinery, Pytorch has dynamic graph, which allows you to decide how many steps to unroll your RNN. With simple architectures, it allows you to stop time batching at all (which you have to do in case of keras) and process whole batch of variable length sequences using one RNN call preceeded with pack_padded_sequence.

Unfortunately, it’s not working such smoothly in case of seq2seq architecture.

DiffEverything · August 18, 2017, 6:34pm

I had the same problem for an encoder-decoder model used for summarization. encoder sequences ~400 and decoder of ~120 max length for my 12G on TitanX. Aso this was on v0.2.

You could try using a different optimizer…one that doesnt track gradients…

Shmuma · August 18, 2017, 7:52pm

Hm, that’s an interesting suggestion. Could you give more information about optimizers? I thought all of them just use gradients gathered from computation graph built on Variable operations.

Today I had different idea to split both input and output sequences on chunks of fixed size and update gradients at the end of each chunk. It should be similar to “RNN unrolling” in keras or TF, and in theory, can have negative effect on convergence, but looks like a solution to memory limitation. But haven’t tried it yet.

Pierre-Bartet · August 19, 2017, 4:25pm

I tried your example without the enormous
net_out = nn.Linear(in_features=HIDDEN_SIZE, out_features=INPUT_SIZE)
layer, and it is not so terrible.

If you want to have a Linear layer after the LSTM, you have to repeat INPUT_SIZE times a single nn.Linear(in_features=HIDDEN_SIZE, out_features=1).

It doesn’t change our conceptual problem but you can build much more longer seq2seq architecture this way.

Shmuma · August 19, 2017, 7:09pm

That’s something I don’t understand. Could you please explain?

My understanding that net_out is not too large (51210) compared to LSTM unit itself which has 4(512*10 + 512^2) i.e. hos 200 times more parameters. Additionally, with trainer mode of seq2seq, on every step of decoder RNN I need to apply linear layer to calculate next token for this step to feed in as input on next step.

Of course, in real-world applications, when count of output tokens will be full vocabulary (hundreeds of thousands or even millions of words), output layer will dominate. But this is completely different story which can be solved by hierarchical or sampled softmax.

Pierre-Bartet · August 20, 2017, 8:05am

My bad I got confused between INPUT_SEQ_LEN and INPUT_SIZE
By the way with no GPU I got :
#input_seq_len = 500: // 2.8GB

Shmuma · August 20, 2017, 8:46am

Hm, that’s interesting. Maybe some cudnn issue/inefficiency.

Pierre-Bartet · August 21, 2017, 6:48am

After some tests, I don’t think there is any solutions. I was afraid weights were duplicated along the sequence, leading to a 512^2 * INPUT_SEQ_LEN memory usage, but weight sharing seems to be correctly implemented and it is not the case.
Still when backpropagating, we have to store the forward pass first, So whatever we do we will have a 512 * INPUT_SEQ_LEN consumption. The only solution is indeed to limit back-propagation through time to a limited number of step.