CUDA out of memory when back propgation

Omega_Ma · November 19, 2022, 6:23pm

Hi, I have a large-size network that is out of the memory of one GPU. So I put different layers into different GPUs. This has fixed the 'out of memory error when loading the model. However, the error still shows in the step of backpropagation.

Traceback (most recent call last):
  File "/scratch/project_2005641/THz_DNN/THz_Huge.py", line 379, in <module>
    Loss_cache, Lr_list = train_model()
  File "/scratch/project_2005641/THz_DNN/THz_Huge.py", line 127, in train_model
    loss.backward()  # backpropagation
  File "/usr/local/lib64/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 2; 31.75 GiB total capacity; 16.01 GiB already allocated; 14.78 GiB free; 16.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any solution to fix this?

pentachris · November 19, 2022, 7:59pm

The backward pass locates additional memory on your GPU to store each parameter’s gradient value. Only leaf tensor nodes (model parameters and inputs) get their gradient stored in the grad attribute. Therefore the memory usage is increasing between the inference (forward pass) and the backpropagation. So its no suprise that you get this message if it even had problems only by inference.
Iam no expert in GPU utilization but you could try:

Make your Network smaller or try to get a more powerful computer (the obvious ones if its possible :D)
If you have more GPUs left, try to experiment and use everything you have on more different layers
You can go to the documentation for memory management and pytorchs cuda config or watch some toutorials/read through the internet about GPU utilization

Best of luck!

Omega_Ma · November 19, 2022, 9:21pm

Thanks for the suggestion.