Pre-allocate memory in case of variable input length?

In pytorch tutorial, I found this article.

It says, if you have dataset of variable lengths then pre-allocating memory with the maximum length of input can help avoiding OOM error.
Pre-allocation of memory can be done by the following steps:

  1. generate a (usually random) batch of inputs with maximum sequence length (either corresponding to max length in the training dataset or to some predefined threshold)
  2. execute a forward and a backward pass with the generated batch, do not execute an optimizer or a learning rate scheduler, this step pre-allocates buffers of maximum size, which can be reused in subsequent training iterations
  3. zero out gradients
  4. proceed to regular training

My question is, should I do the step 1-3 at the beginning of every iteration? or only once at the beginning of the training code?

Also, is it related to this fragmentation problem?

You should do it once before the actual training starts, as the memory would be pre-allocated and moved to the cache afterwards. As long as you don’t clear the cache via torch.cuda.empty_cache() you wouldn’t have to rerun it.

Hi @ptrblck, I am just wondering does the above mentioned pre-allocate process avoid the GPU OOM error in the middle of the training process? Thanks!

It could help if the OOM is triggered by memory fragmentation caused by allocating increasingly larger tensors, e.g. via:

for size in torch.arange(1, 100):
    x = torch.randn(size, device='cuda')

as the larger tensors won’t be able to directly reuse the already allocated memory of the smaller one.
Reversing the order, i.e. starting with the largest tensor, would fix this.
With that being said, I would claim it depends on your use case and why you are running OOM or what is causing the fragmentation in the first place and what exactly you would like to pre-allocate.

Thanks! It really helpful!

Hello,

I have an (huge) OOM error when feeding input with variable batchsize to a BIGGAN to do just inference (@ torch.no_grad).

The workaround that saved me for a bit has been to somehow fix the batchsize. However, now I need to run an exp which needs again a variable input size, and the problem reappeared.

I tried (as I already did in the past) to pre-allocate memory with the largest batch size possible. But, this is not solving the problem.

To give you a bit more context.

  • I am running BIGGAN inference at every iteration to generate a variable number of images which are then used to train another model (this model instead has always fixed batchsize).
  • The error always happens while doing conv2d in the BIGGAN. See screenshot below
  • I am using distributed data parallel, multi-gpu and multi-node.
  • I am not using cuda.empty_cache() at any point, and neither cudnn.benchmark.
  • The way in which I am pre-allocating memory is by generating the maximum number of images (the whole batch) at iteration 0.
  • Error obtained with pyt 1.8.0, torchvision 0.9.0, cuda 10.2

here’s the error log:

BigGAN_PyTorch/layers.py", line 140, in forward
    return F.conv2d(
RuntimeError: CUDA out of memory. Tried to allocate 18.67 GiB (GPU 3; 31.75 GiB total capacity; 2.00 GiB already allocated; 18.67 GiB free; 11.48 GiB reserved in total by PyTorch)

Thanks