The following message says that given 23.69Gb of total GPU memory, allocation of more than 610Mb leads to OOM. On the other hand, the log from nvidia-smi
shows the real situation.
Pytorch message from a 24Gb GPU:
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.69
GiB total capacity; 595.94 MiB already allocated; 2.06 MiB free; 610.00 MiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF
A similar message from a 12Gb GPU:
OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.75 GiB total capacity; 810.93 MiB already
allocated; 3.62 MiB free; 828.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting
max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Logging with nvidia-smi shows that a memory issue indeed occurred:
#Date Time gpu fb bar1 sm mem enc dec
#YYYYMMDD HH:MM:SS Idx MB MB % % % %
20230511 12:28:15 0 447 6 0 0 0 0
20230511 12:28:17 0 477 6 0 0 0 0
20230511 12:28:19 0 507 6 0 0 0 0
20230511 12:28:21 0 555 6 0 0 0 0
20230511 12:28:23 0 625 6 0 0 0 0
20230511 12:28:25 0 861 6 0 0 0 0
20230511 12:28:27 0 9961 6 2 0 0 0
20230511 12:28:29 0 9961 6 0 0 0 0
20230511 12:28:31 0 9961 6 0 0 0 0
20230511 12:28:33 0 9961 6 0 0 0 0
20230511 12:28:35 0 9961 6 0 0 0 0
20230511 12:28:37 0 9961 6 0 0 0 0
20230511 12:28:39 0 9961 6 0 0 0 0
20230511 12:28:41 0 9961 6 0 0 0 0
20230511 12:28:43 0 9961 6 0 0 0 0
20230511 12:28:45 0 9961 6 0 0 0 0
20230511 12:28:47 0 9961 6 0 0 0 0
20230511 12:28:49 0 9961 6 0 0 0 0
20230511 12:28:51 0 9961 6 0 0 0 0
20230511 12:28:53 0 9961 6 0 0 0 0
20230511 12:28:55 0 9961 6 0 0 0 0
20230511 12:28:57 0 9961 6 0 0 0 0
20230511 12:28:59 0 9961 6 0 0 0 0
20230511 12:29:01 0 9961 6 0 0 0 0
20230511 12:29:03 0 9961 6 0 0 0 0
20230511 12:29:05 0 9961 6 0 0 0 0
20230511 12:29:07 0 10541 6 3 0 0 0
20230511 12:29:09 0 10673 6 0 0 0 0
20230511 12:29:11 0 10771 6 0 0 0 0
20230511 12:29:13 0 10991 6 1 0 0 0
20230511 12:29:15 0 10991 6 0 0 0 0
20230511 12:29:17 0 10649 6 1 0 0 0
20230511 12:29:19 0 10649 6 0 0 0 0