Tried to allocate 784.00 MiB (GPU 0; 23.99 GiB total capacity; 7.15 GiB already allocated; 13.69 GiB free; 7.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation

Siladittya_Manna · March 5, 2024, 9:32am

I have come across several posts which discuss the same error, but in all posts I have seen that the free memory is less than that PyTorch is trying to allocate.

In my case the free memory is showing more than 13 GB, while it is trying to allocate 784MB.

I have tried using a lower batch size (128) down from 256, but I received the following error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 23.99 GiB total capacity; 10.11 GiB already allocated; 11.45 GiB free; 10.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I don’t seem to understand the main issue here.

UPDATE:: Decreasing the number of workers from 8 to 1 solved the issue. Later increased to 4, and still working.

Later increased to 8, which worked this time.

Again failed to run in a later run. Had to decrease the number of workers again.

Siladittya_Manna · August 30, 2024, 9:33am

Decreasing the number of workers did not work today!

Soumya_Kundu · August 30, 2024, 9:58am

Can you make a mock simulation code while tracking your memory using nvidia-smi/pytorch profiler/nvitop.

Maybe something in the background is getting generated and not being removed or something completely external.