Misunderstanding CUDA out of memory

Hi,

I have a question with CUDA out of memory, I already know how to solve it, I just wonder the meaning of the bug.

GPU: RTX 2080Ti, CUDA 10.1
Pytorch version: 1.6.0+cu101
Model: EfficientDet-D4

When I trained it with the batch size is 1, it took 9.5 GiB GPU RAM, then I tried to increase the batch size and it returned:

# Batch_size = 2
CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 11.00 GiB total capacity; 8.32 GiB already allocated; 2.59 MiB free; 8.37 GiB reserved in total by PyTorch) 

# Batch_size = 3
CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.00 GiB total capacity; 8.23 GiB already allocated; 48.59 MiB free; 8.32 GiB reserved in total by PyTorch) 

# Batch_size = 4
CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.00 GiB total capacity; 8.03 GiB already allocated; 90.59 MiB free; 8.28 GiB reserved in total by PyTorch) 

# Batch_size = 5
CUDA out of memory. Tried to allocate 240.00 MiB (GPU 0; 11.00 GiB total capacity; 8.06 GiB already allocated; 38.59 MiB free; 8.33 GiB reserved in total by PyTorch) 

# Batch_size = 10
CUDA out of memory. Tried to allocate 1.41 GiB (GPU 0; 11.00 GiB total capacity; 7.19 GiB already allocated; 964.59 MiB free; 7.43 GiB reserved in total by PyTorch) 

I am confused about how to measure the allocated memory, why the already allocated memory keeps decrease if I increase the batch size, and what is the meaning of reserved memory in that pop-up?

I read the code and comment in pytorch github (line 247-272). The comment mentioned the cached memory, so what is it in my case?

Note: I noticed the memory of the driver when I killed all processes and free all tasks, it took 0.4~0.5/11GB in my GPU

1 Like

PyTorch tries to allocate the memory for the complete tensor, so increasing the batch size would also increase (some) tensors and thus the memory blocks are also bigger. If you are now running out of memory, the failed memory block might be bigger (as seen in the “tried to allocate …” message), while the already allocated memory is smaller.

Reserved memory returns the allocated and cached memory.
Cached memory is used to be able to reuse device memory without reallocating it.

I am not familiar with controlling memory or memory distribution in hardware, so I cannot discuss further in the first sentence (if you have some related documents, it will help me a lot)

    // The sum of "allocated" + "free" + "cached" may be less than the
    // total capacity due to memory held by the driver and usage by other
    // programs.

Follow Pytorch github (line 257-259) that I mentioned above and with your answer, I have a new question.

Is it should be:

 total = allocated + free + cached + driver # driver is 0.4 GiB, I mentioned above
       = allocated + free + (reserved - allocated) + driver # follow your answer
       = free + reserved + driver

Let take batch size = 2 to be an example, we have:

2.59 MiB + 8.37 GiB + 0.4 GiB = 8.7725 GiB 

So where is the rest of the memory? 2.2275 GiB

1 Like