`.backward()` doubles RAM usage


I’m having RAM related issues with a new model I implemented.
So I just checked how much memory was needed at each steps. I was running the model and monitoring process max RAM usage, when I exit at different steps.

I tried with both SGD and ADAM

  • Right after .forward(): 11GB
  • Right after computing loss: same
  • Right after .backward(): failed: Out Of memory, >32GB (including swap)

Isn’t it weird that the .backward consumes that much RAM? i.e. more than the model itself?
My model itself may not be optimized with respect to memory i.e. maybe do I keep too much informations etc. It contains some non-parameter Variable (i.e. with requires_grad=False). But I don’t see how those thing may impact the backward pass…

Any ideas?

What is the RAM consumption if use .backward(retain_variables=True)?

Exactly the same.

I’m not sure how retain may change it tho.

You are right, it take more memory instead.

The intermediate gradients are allocated during backward pass, so it’s normal that the memory usage is roughly twice as much in backward as it is during forward pass.

1 Like

In fact, I don’t see such a difference with other models like OpenNMT-py

I’m implementing my model as part of onmt so the train.py is the same.

The thing is that

  • my model has less parameters than default ONMT
  • my model uses more history variable (e.g. previous states h_t, c_t etc.)

Maybe the second point makes it more difficult to calculate gradient?

For reference I’m working on the ML part of Paulus et al (2017)

The amount of memory required for backward depends linearly on the depth of the network.
Pytorch does some optimizations to reduce the memory requirements, but it still needs a bit less than 2x as much memory as if you are computing only the forward (with volatile=False).

I’m facing the same problem, actually its worse in my case.
forward consumes 1.5G of memory and backward uses 11G of memory. I’m using a LSTM encoder and decoder. 11x increase doesn’t seem normal

Edit - The length of encoder/decoder are fairly substantial (140 for encoder, 40 for decoder). Could this be due to computing intermediate gradients on each step during backward, while for forward it doesn’t copy the weights like a persistent RNN?

Edit 2- I’ve run a few tests. Decreasing decoder input length reduces memory consumption. So, its likely that backward memory balloons because of gradient computation at each step. But the main problem I’m having is for each backward call extra memory is allocated. So after a few training steps I get an OOM error.

I’ve made sure to delete the graph after each step by deleting the loss I backprop on. Not sure where the leak is from really

Not sure it’s a leak, maybe just that your network requires a lot of memory?
I’m not very familiar with RNNs in pytorch, so I’m afraid I can’t help you much more without code snippets

Maybe I am overlooking something obvious, but with retain_variables false, why can’t backpropagation overwrite the activations with the gradients so as not to require extra memory?

Our cases may be quite similar. Still, it is probably much more complicated than just a pytorch error.
As @fmassa said it make sense that the backward phase GPU requirement is really tied with the architecture of your model. By this I mean not only the number of parameters.

For example, my model has way less parameters than another one that works with 8GB of RAM while mine can’t. The reason behind it is - I think - that, because of intradecoder attention. i.e. at each step I compute attention of the h_t with all previous h_t’ (with t’<t). This links the gradient of the current step to every previous steps. The same goes for other variables.

I’m not sure if I’m clear, but, by studying my code, the paper, and other papers with respective implementation it make sense that the model i’m trying to implement requires more RAM. Maybe it’s the same for you.

Again, for reference, I’m talking about A Deep Reinforced Model for Abstractive Summarization.

If someone is interested by implementing it we could collaborate.

Very interesting.
Funny enough I was trying to implement a different summarization model- https://nlp.stanford.edu/pubs/see2017get.pdf
I had to severely limit the sequence length or model size to run it at max memory usage.

There is no-intra decoder attention in this model but memory usage is still pretty high. I’m convinced that the reason is storing gradients for all time steps. Which is to be expected.

But I got it to work and am training it right now which takes a long time.

If anyone is interested - https://github.com/hashbangCoder/Text-Summarization

@DiffEverything in my case I found a super useful trick that allows the GPU RAM to stay quite constant (vs increasing per decoding time step previously).

Basically, I’m running .backward() now at each timestep. backward frees some buffers, that’s what making our model ‘lighter’ but may also cause issues. If your running backward on a variable, each variables that was used to compute it will be freed. Thus, if you need one of this variable in the next iteration it will fail, telling you already called backward. The trick is to explicitely copy variables that will be needed in next step. Doing this you are freeing anything but the vairable you copied.

Quite hacky but make sense and worked well for me. I couldn’t run an iteration with a batchsize=1 on my 8GB GPU, I can now run it with 64 batchsize and not close to be OOM.