Seq-to-Seq Encoder Decoder Models with Reinforcement Learning - CUDA memory consumption debugging

Hi,

Problem summary: I have implemented a seq-to-seq encoder decoder model that uses Reinforcement Learning for model training. I have encountered following error while training (exactly when attention is being computed i.e.
intermediate = vector.matmul(self._w_matrix).unsqueeze(1) + matrix.matmul(self._u_matrix)):

‘’‘RuntimeError: CUDA out of memory. Tried to allocate 944.00 MiB (GPU 0; 11.17 GiB total capacity; 9.86 GiB already allocated; 310.81 MiB free; 10.58 GiB reserved in total by PyTorch)’’’

Model Details:

  1. The encoder model encodes an input sequence using Pytorch’s LSTM cell (hidden_size=512).
  2. The decoder model’s LSTM cell is initialised using encoder model’s output. And at every time-step(t), decoder model’s input is a concatenation of (a) previous time step’s prediction (t-1) and (b) attended input produced by attending to encoder outputs.
  3. Training objective: We beam sample top-k predictions from the decoder model and generate the reward for each decoding. I am back-propagating loss = log probabilities of top-k decodings * their respective rewards (REINFORCE trick).

The attention mechanism to compute weights is the following:
Attention is computed between a vector, x (in our case h_t) and a matrix, y (outputs of encoder) using an additive attention function. The function has two matrices W, U and a vector V. The similarity between the vector x and the matrix y is computed as V tanh(Wx + Uy).

Parameters: I am sampling 64 top decodings, batch-size: 16, hidden size of 512 for every LSTM cell (encoder and decoder both), max sequence length of 50 when decoding using beam search.

I know this might be confusing without exact code. Let me know if any part of the code needs to be shared for more clarity. I want to debug what is consuming most of the memory because the same error appears on GPU with larger memory of 32GBs (tried on k80 and v100).

I would recommend to add print(torch.cuda.memory_allocated()) statements in your code and check each operation or layer sequentially.
This would give you the overview where most of the memory is used.

1 Like

Thanks, will debug and report if I am not able to resolve it.

Hi, @ptrblck I tried debugging by checking each operation sequentially. Figured out that decoder is essentially the bottleneck because using encoder outputs, I am decoding new predictions at each time-step using previous step prediction and attended inputs (Using LSTM cell). Saving new predictions at each time step in a list is also saving its computation graph which is essentially increasing the memory consumed.

What I want to understand is for the following memory consumption during sequence decoding:

TRAINING ITERATION: 0
Decoding start
allocated: 58M, max allocated: 82M, cached: 102M, max cached: 102M
-- Before Beam Sampling
allocated: 58M, max allocated: 82M, cached: 102M, max cached: 102M
TimeStep_0
-- _prepare_output_projections | Start
allocated: 58M, max allocated: 82M, cached: 102M, max cached: 102M
-- _prepare_output_projections | New Decoder state computed
allocated: 60M, max allocated: 82M, cached: 102M, max cached: 102M
TimeStep_1
-- _prepare_output_projections | Start
allocated: 94M, max allocated: 119M, cached: 156M, max cached: 156M
-- _prepare_output_projections | New Decoder state computed
allocated: 162M, max allocated: 207M, cached: 264M, max cached: 264M
TimeStep_2
-- _prepare_output_projections | Start
allocated: 231M, max allocated: 256M, cached: 266M, max cached: 266M
-- _prepare_output_projections | New Decoder state computed
allocated: 299M, max allocated: 344M, cached: 374M, max cached: 374M


TimeStep_19
-- 1) _prepare_output_projections | Start
allocated: 2395M, max allocated: 2420M, cached: 2442M, max cached: 2442M
-- 4) _prepare_output_projections | New Decoder state computed
allocated: 2464M, max allocated: 2508M, cached: 2548M, max cached: 2548M
-- After Beam Sampling
allocated: 2529M, max allocated: 2530M, cached: 2550M, max cached: 2550M

TRAINING ITERATION: 1
Decoding start
allocated: 214M, max allocated: 2600M, cached: 2664M, max cached: 2664M
-- Before Beam Sampling
allocated: 214M, max allocated: 2600M, cached: 2664M, max cached: 2664M
TimeStep_0
-- _prepare_output_projections | Start
allocated: 214M, max allocated: 2600M, cached: 2664M, max cached: 2664M
-- _prepare_output_projections | New Decoder state computed
allocated: 218M, max allocated: 2600M, cached: 2664M, max cached: 2664M
TimeStep_1
-- _prepare_output_projections | Start
allocated: 307M, max allocated: 2600M, cached: 2786M, max cached: 2786M
-- _prepare_output_projections | New Decoder state computed
allocated: 429M, max allocated: 2600M, cached: 2850M, max cached: 2850M
TimeStep_2
-- _prepare_output_projections | Start
allocated: 605M, max allocated: 2600M, cached: 2974M, max cached: 2974M
-- _prepare_output_projections | New Decoder state computed
allocated: 727M, max allocated: 2600M, cached: 3036M, max cached: 3036M


TimeStep_33
-- 1) _prepare_output_projections | Start
allocated: 9839M, max allocated: 9899M, cached: 9958M, max cached: 9958M
-- 4) _prepare_output_projections | New Decoder state computed
allocated: 9962M, max allocated: 10006M, cached: 10074M, max cached: 10074M
-- After Beam Sampling
allocated: 10136M, max allocated: 10137M, cached: 10196M, max cached: 10196M

_prepare_output_projections is the function that computes new decoder state (h_t, c_t) using previous step prediction and attended input. The function also returns new predictions by projecting h_t, output of LSTM cell into the vocabulary space.

At each time-step, I am predicting for batch_size x beam_size sequences. That is, if batch_size = 8 and beam_size = 100, I am making 8x100 = 800 predictions at each time step. Above stats are shown for batch_size = 2 and beam_size = 32. Max_time_steps_allowed = 50

I can see that with increase in number of predictions, allocated memory increases. Is there something wrong, or is this is the expected behaviour? I want to experiment with batch_size =16 for stable training. Should I go with gradient accumulation? Please, advise.

Do you need to call backward at some point and want Autograd to use all computation graphs?
I’m not familiar with your use case, but for RNNs you could detach the last state(s) so that the backward pass only calculates the gradients for the current step.

I am calling backward on computed reward which is calculated in the following fashion:
For each training sample in the batch, I will have to first decode n complete sequences (n = beam_size), evaluate them based on a metric to calculate reward(or loss) for back-propagation.

For example, for each sample in the batch, decode n sequences so that output of decoder is of size: (batch_size, n, max_seq_len). The metric can evaluate sequences only after they are complete (i.e. only after end_token is sampled). Reward is computed for a total of batch_size*n sequences.

If using batch_size = 8, and beam size = 32. I am sampling 8*32 sequences from the decoder, calculating their (1) log probability and (2) reward, using Reinforce Trick to calculate loss which is reward*log_probability before calling backward. Here is the loss -

image

The loss is the expectation of reward for decoded sequences, r(.) is the reward for complete sequences, y_t are hidden states sample from the decoder and pi is decoder model.

In this training regime, I have no GT to match predictions with at each time-step. I can only back-propagate after I have sampled complete sequence. Thus saving predictions for all sequences for all time-steps is saving their computation graphs which I think is incrementally increasing GPU memory consumption. Right?

Solutions tried: I have tried gradient accumulation where I am splitting batch_size into sub-batches, calling loss.backward for each of them before calling optimiser.step().

Question: Is there anything that can be done to minimise such huge memory consumption. Let me know if you need specific code excerpt to debug.

Paper for reference https://openreview.net/pdf?id=H1Xw62kRZ

Thanks for the update.

If your model is not depending on the batch size, e.g. via batchnorm layers, this should be a valid approach to trade compute for memory.
Did you still run out of memory using this approach?