CUDA allocator not able to use cached memory [solution]

Hi, this is very similar to this post here: Unable to allocate cuda memory, when there is enough of cached memory, but I just wanted to check if my proposed solution should work as a fix.

My error is:

RuntimeError: CUDA out of memory. Tried to allocate 1.53 GiB (GPU 3; 15.78 GiB total capacity; 6.74 GiB already allocated; 792.19 MiB free; 13.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 1.53 GiB (GPU 3; 15.78 GiB total capacity; 6.74 GiB already allocated; 792.19 MiB free; 13.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

As you can see, I’m trying to allocate 1.53GiB with 7GiB allocated. Ideally the extra 6GiB reserved by PyTorch could be used for the 1.53GiB, but it doesn’t seem to be able to handle this. Likely this is because my application does a lot of GPU-CPU swaps as it’s using a memory offload system I’ve developed (similar to ZeRO).

I know that DeepSpeed handles memory management themselves to avoid this issue, but I’m just looking for a quick fix. Would setting this variable:

PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

avoid fragmentation and thus resolve my issue? What are the implications of setting this variable?

I would appreciate any help on this subject.

Tuning the caching allocator split size is kind of in the real of black magic, so it’s not exactly easy to predict what would happen other than just running your code/model with a few settings to see what happens.

1 Like

Can confirm it worked. Hopefully this post helps anyone else with the same issue.

1 Like

I had the same issue, here’s my experience with the “black magic”, let the next person build upon this :slight_smile:
The error I got was

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.95 GiB (GPU 0; 23.69 GiB total capacity; 8.02 GiB already allocated; 4.94 GiB free; 18.09 GiB reserved in total by PyTorch)

So 18gb reserved, but only 8gb allocated. Looks like a textbook example of memory fragmentation. I set max_split_size_mb:128, which solved OOM but made GPU utilization (as show by nvidia-smi), much worse: where previously it was stable at 99-100%, now it was fluctuating in the range 20-100%. Increasing max_split_size_mb to 1024 improved utilization, but not by much. Increasing it further to 4096 returned me to 100% utilization. I chose that value because based on the error message, the trouble was with the block >4gb in size, so it made sense to not allow splitting blocks that large, while still allowing smaller blocks to be split.