CUDA out of memory when back-propagating loss


I have a customized GCN-Based Network and a pretty large graph (40000 x 40000). Basically, There is no problem with forwarding passing (i.e the GPU memory is enough); but cuda ran out of memory when loss.backward() is executed. The error message is as following

Traceback (most recent call last): File "", line 142, in <module> main() File "", line 125, in main r_gcn.train() File "/afs/", line 167, in train loss.backward() File "/afs/lib/python3.8/site-packages/torch/", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/afs/lib/python3.8/site-packages/torch/autograd/", line 98, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA out of memory. Tried to allocate 7.72 GiB (GPU 0; 31.75 GiB total capacity; 27.83 GiB already allocated; 2.90 GiB free; 27.86 GiB reserved in total by PyTorch) (malloc at /opt/conda/conda-bld/pytorch_1591914858187/work/c10/cuda/CUDACachingAllocator.cpp:289) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x2b9c1d781b5e in /afs/lib/python3.8/site-packages/torch/lib/ frame #1: <unknown function> + 0x1f39d (0x2b9c1d54339d in /afs/lib/python3.8/site-packages/torch/lib/ frame #2: <unknown function> + 0x2058e (0x2b9c1d54458e in /afs/lib/python3.8/site-packages/torch/lib/ frame #3: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x291 (0x2b9bf50eb401 in /afs/lib/python3.8/site-packages/torch/lib/ frame #4: <unknown function> + 0xdc454b (0x2b9bf339e54b in /afs/lib/python3.8/site-packages/torch/lib/ frame #5: <unknown function> + 0xe0de37 (0x2b9bf33e7e37 in /afs/lib/python3.8/site-packages/torch/lib/ ........ frame #24: <unknown function> + 0x7ea5 (0x2b9bca00aea5 in /lib64/ frame #25: clone + 0x6d (0x2b9bca31d8cd in /lib64/

Since the error message requests roughly 7G memory on GPU, which is about the size of the graph, and that I need the gradient of graph-entries. I suppose it is because the backward function needs another big matrix to store the computed gradient. However, I am not sure about what the other error outputs mean.

Thank you for your help!

Your assumption sounds reasonable, since the first backward call will use additional memory to store the gradients.

You could reduce the batch size and rerun it or alternatively you could also trade compute for memory via torch.utils.checkpoint.

The malloc stack trace points again to the CUDACachingAllocator, so you can ignore this message, as the RuntimeError was already raised.

Thank you! This is very helpful.