Confused why GPU total capacity is bigger than PyTorch reserved memory(23.69 GiB vs 20.79 GiB)

ztysdu · April 12, 2023, 1:35am

Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.47 GiB (GPU 3; 23.69 GiB total capacity; 19.38 GiB already allocated; 1.44 GiB free; 20.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Dear all,
I am confused why GPU 3 total capacity is 23.69 GiB while PyTorch only reserved 20.79 GiB.
I am sure that no other process is running on the GPU
I tried to set os.environ[“PYTORCH_CUDA_ALLOC_CONF”] = “max_split_size_mb:1024”.
However, it increased nearly 30% of the computation time. It can be seen that the required
memory is only little more than the free memoey

ptrblck · April 12, 2023, 1:46am

Besides the memory allocations PyTorch owns your GPU will also use memory for the driver, for each loaded CUDA kernel etc. Depending on the GPU, the CUDA version, if CUDA’s lazy module loading is enabled (it’s enabled for all PyTorch binaries shipping with CUDA >= 11.7), the loaded libraries, the actually used memory would differ.