CUDA OOM message that doesn't make sense

kzaitse · May 11, 2023, 10:31am

The following message says that given 23.69Gb of total GPU memory, allocation of more than 610Mb leads to OOM. On the other hand, the log from nvidia-smi shows the real situation.

Pytorch message from a 24Gb GPU:

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.69 
GiB total capacity; 595.94 MiB already allocated; 2.06 MiB free; 610.00 MiB 
reserved in total by PyTorch) If reserved memory is >> allocated memory try 
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory 
Management and PYTORCH_CUDA_ALLOC_CONF

A similar message from a 12Gb GPU:

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.75 GiB total capacity; 810.93 MiB already 
allocated; 3.62 MiB free; 828.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Logging with nvidia-smi shows that a memory issue indeed occurred:

#Date       Time        gpu     fb   bar1     sm    mem    enc    dec 
#YYYYMMDD   HH:MM:SS    Idx     MB     MB      %      %      %      % 
 20230511   12:28:15      0    447      6      0      0      0      0 
 20230511   12:28:17      0    477      6      0      0      0      0 
 20230511   12:28:19      0    507      6      0      0      0      0 
 20230511   12:28:21      0    555      6      0      0      0      0 
 20230511   12:28:23      0    625      6      0      0      0      0 
 20230511   12:28:25      0    861      6      0      0      0      0 
 20230511   12:28:27      0   9961      6      2      0      0      0 
 20230511   12:28:29      0   9961      6      0      0      0      0 
 20230511   12:28:31      0   9961      6      0      0      0      0 
 20230511   12:28:33      0   9961      6      0      0      0      0 
 20230511   12:28:35      0   9961      6      0      0      0      0 
 20230511   12:28:37      0   9961      6      0      0      0      0 
 20230511   12:28:39      0   9961      6      0      0      0      0 
 20230511   12:28:41      0   9961      6      0      0      0      0 
 20230511   12:28:43      0   9961      6      0      0      0      0 
 20230511   12:28:45      0   9961      6      0      0      0      0 
 20230511   12:28:47      0   9961      6      0      0      0      0 
 20230511   12:28:49      0   9961      6      0      0      0      0 
 20230511   12:28:51      0   9961      6      0      0      0      0 
 20230511   12:28:53      0   9961      6      0      0      0      0 
 20230511   12:28:55      0   9961      6      0      0      0      0 
 20230511   12:28:57      0   9961      6      0      0      0      0 
 20230511   12:28:59      0   9961      6      0      0      0      0 
 20230511   12:29:01      0   9961      6      0      0      0      0 
 20230511   12:29:03      0   9961      6      0      0      0      0 
 20230511   12:29:05      0   9961      6      0      0      0      0 
 20230511   12:29:07      0  10541      6      3      0      0      0 
 20230511   12:29:09      0  10673      6      0      0      0      0 
 20230511   12:29:11      0  10771      6      0      0      0      0 
 20230511   12:29:13      0  10991      6      1      0      0      0 
 20230511   12:29:15      0  10991      6      0      0      0      0 
 20230511   12:29:17      0  10649      6      1      0      0      0 
 20230511   12:29:19      0  10649      6      0      0      0      0

eqy · May 12, 2023, 5:47am

Does nvidia-smi show other running processes that are using memory (e.g., display drivers, other instances of PyTorch)?

kzaitse · May 12, 2023, 7:33am

I run it on a SLURM cluster. The training script is the only process that can consume more GPU.