Hi,
I have a customized GCN-Based Network and a pretty large graph (40000 x 40000). Basically, There is no problem with forwarding passing (i.e the GPU memory is enough); but cuda ran out of memory when loss.backward() is executed. The error message is as following
Traceback (most recent call last): File "main.py", line 142, in <module> main() File "main.py", line 125, in main r_gcn.train() File "/afs/crc.nd.edu/group/dmsquare/vol3/bni/user/e2e/model/R_GCN.py", line 167, in train loss.backward() File "/afs/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/afs/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA out of memory. Tried to allocate 7.72 GiB (GPU 0; 31.75 GiB total capacity; 27.83 GiB already allocated; 2.90 GiB free; 27.86 GiB reserved in total by PyTorch) (malloc at /opt/conda/conda-bld/pytorch_1591914858187/work/c10/cuda/CUDACachingAllocator.cpp:289) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x2b9c1d781b5e in /afs/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x1f39d (0x2b9c1d54339d in /afs/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: <unknown function> + 0x2058e (0x2b9c1d54458e in /afs/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x291 (0x2b9bf50eb401 in /afs/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdc454b (0x2b9bf339e54b in /afs/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #5: <unknown function> + 0xe0de37 (0x2b9bf33e7e37 in /afs/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) ........ frame #24: <unknown function> + 0x7ea5 (0x2b9bca00aea5 in /lib64/libpthread.so.0) frame #25: clone + 0x6d (0x2b9bca31d8cd in /lib64/libc.so.6)
Since the error message requests roughly 7G memory on GPU, which is about the size of the graph, and that I need the gradient of graph-entries. I suppose it is because the backward function needs another big matrix to store the computed gradient. However, I am not sure about what the other error outputs mean.
Thank you for your help!