OutOfMemoryError: CUDA out of memory despite available GPU memory

Hello PyTorch community,

I’m encountering an issue with GPU memory allocation while training a GPT-2 model on a GPU with 24 GB of VRAM. Despite having a substantial amount of available memory, I’m receiving the following error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.68 GiB total capacity; 18.17 GiB already allocated; 64.62 MiB free; 18.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Here are the specifications of my setup and the model training:

  • GPU: NVIDIA GPU with 24 GB VRAM
  • Model: GPT-2 with approximately 3 GB in size and 800 parameters of 32-bit each
  • Training Data: 36,000 training examples with vector length of 600
  • Training Configuration: 5 epochs, batch size of 16, and fp16 enabled

These are my calculations:

  1. Model Size:
  • GPT-2 model: ~3 GB
  • Parameters: 800 parameters of 32 bits each
  1. Gradients:
  • Gradients are typically of the same size as the model’s parameters.
  1. Batch Size and Training Examples:
  • Batch Size: 16
  • Training Examples: 36,000
  • Vector Length: 600
  1. Memory Allocation per Batch:
  • Model: 3 GB (unchanged per batch)
  • Gradients: 3 GB (unchanged per batch)
  • Input Data: 16 x 600 (vector length) x 4 bytes (assuming each value is a 32-bit float) = 37.5 KB per batch
  • Output Data: 16 x 600 (vector length) x 4 bytes (assuming each value is a 32-bit float) = 37.5 KB per batch

Based on the above calculations, the memory allocation per batch for my scenario would be approximately:

  • Model: 3 GB
  • Gradients: 3 GB
  • Input and Output Data: 75 KB

I would appreciate any insights or suggestions on how to resolve this issue. Thank you in advance for your assistance!
@albanD @ptrblck

In your calculation you are ignoring the intermediate forward activations, which need to be stored in order to compute the gradients. Based on the model architecture these activations could larger than the model parameters.

2 Likes