torch.cuda.OutOfMemoryError: CUDA out of memory

yrh · January 10, 2024, 8:15am

I am running pytorch on docker: [2.1.2-cuda11.8-cudnn8-devel].
I was trying to run the training script from GitHub - xg-chu/CrowdDet, and got the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 62.00 MiB. GPU 0 has a total capacty of 2.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 967.91 MiB is allocated by PyTorch, and 76.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.51       Driver Version: 511.69       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| N/A   49C    P0    N/A /  N/A |      0MiB /  2048MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        22      G   /Xwayland                       N/A      |
|    0   N/A  N/A        33      G   /Xwayland                       N/A      |
|    0   N/A  N/A        34      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+

I’m using a Asus zenbook laptop which has a NVIDIA® GeForce® MX250
2GB GDDR5, is that why the GPU is capped at 2GiB?
The part where the error says “Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use.”, is it normal to have such a big process?
How do I tell how much GPU I need to train this model?

ptrblck · January 10, 2024, 3:46pm

Yes, the specs of your GPU mention the memory it ships with and which can be used by applications. 2GB is quite small by now and the model seems to require more memory.
No and it seems to be a misleading or buggy error message.
The authors might have mentioned requirements in their repository. If not, you could run it on the CPU and check how much RAM the process takes. This is of course not an exact measurement, as the host would also load libraries etc. but could give you a starting point.

tnn · February 11, 2024, 11:06am

Hi,

Any idea why the weird “this process has 17179869184.00 GiB memory in use.” error message pops up? I’m too getting this message for a pretty small model that can easily fit into my GPU. I understand the usual error messages when the memory requirement is larger than what the GPU can support, but the memory usage mentioned by this message is extremely weird.

Thanks

ptrblck · February 11, 2024, 5:18pm

No, but could you post a minimal and executable code snippet reproducing this message so that we can debug it, please?