Mitigating CUDA GPU memory fragmentation and OOM issues

eric.mitchell · January 7, 2021, 6:07pm

I’m currently playing around with some transformers with variable batch sizes, and I’m running into pretty severe memory fragmentation issues, with CUDA OOM occurring at less than 70% GPU memory utilization. For example (see the GitHub link below for more extreme cases, of failure at <50% GPU memory):

RuntimeError: CUDA out of memory. Tried to allocate 1.48 GiB (GPU 0; 23.65 GiB total capacity; 16.22 GiB already allocated; 111.12 MiB free; 22.52 GiB reserved in total by PyTorch)

This has been discussed before on the PyTorch forums [1, 2] and GitHub. Fragmentation is also mentioned briefly in the docs for torch.cuda.empty_cache():

empty_cache() doesn’t increase the amount of GPU memory available for PyTorch. However, it may help reduce fragmentation of GPU memory in certain cases. See Memory management for more details about GPU memory management.

According to this thread, we shouldn’t be relying on torch.cuda.empty_cache(). However (in my use case, at least), torch.cuda.empty_cache() is the difference between running code with GiBs of GPU memory to spare and CUDA OOM errors. None of these past threads (as far as I can tell) really led to conclusive mitigation strategies for this problem.

My questions are:

Are there “official” best practices for handling cases when GPU memory fragmentation is severely reducing effective usable GPU memory?
Can we get more documentation about when torch.cuda.empty_cache() is the right tool for addressing OOM/fragmentation issues?

As far as I can tell, moving all allocated tensors into main memory, emptying the cache, then moving them back to GPU memory a la this post seems a reasonable (if potentially expensive) strategy. Is this recommended?

stas · January 7, 2021, 6:49pm

In no way to steal from the focus of this super important issue you raise, I am curious how your code would fair with the DeepSpeed integration - since it manages temp memory allocations itself. And in my few experiments it needs 3-5x less gpu RAM with its ZeRO algorithms. that is if you’re using the HF trainer. If you don’t use the latter, you can see from the code how to activate deepspeed in your own trainer.

We are getting close to completing the integration but you can already try my branch:

There is a doc in it that explains how to activate it.
You’d need to install DeepSpeed from its master.

If you have any follow up questions please ask in that PR thread so that we don’t derail this thread.

We also recently added support for Sharded DDP via fairscale. if you use HF trainer, just install fairscale and add --sharded_ddp to the training args and you should also see a huge improvement in memory utilization. Again if you have questions let’s continue this discussion on HF forums.

Back to the topic of this thread.

eric.mitchell · January 7, 2021, 7:35pm

Very interesting, thanks for sharing! I’m not using the HuggingFace trainer, but I went ahead and ran with deepspeed with my current code.

I normally run:

ENV=var python -m train --arg1 val1 --arg2 val2

I ran:

ENV=var deepspeed ./train.py --arg1 val1 --arg2 val2

I don’t see any difference in runtime/memory usage, though. Is there anything else I need to do? My ds_report looks like this:

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------

Did you pre-build the extension ops? I’ve got a CUDA Version mismatch that’s preventing me from pre-building them with DS_BUILD_OPS=1 pip install deepspeed.

stas · January 7, 2021, 7:56pm

Based on previous experience - continuing the ZeRO solution discussion here will derail your attempt to find core solutions. I highly recommend removing these follow ups and opening a separate thread on HF forums (and tag @stas) and we can continue there.

But the quick answer is that based on your reply - you’re not using deepspeed, you’re just using its launcher. You need to:

create a configuration file which sets up its ZeRO magic.See: DeepSpeed Configuration JSON - DeepSpeed
activate deepspeed Getting Started - DeepSpeed
use its wrapped model in training Getting Started - DeepSpeed
launch it as Getting Started - DeepSpeed

That’s when you will see huge improvements.

zyeric · December 7, 2022, 8:09am

Stas, thanks for your information very much. I am trying to train a GPT3-350M with a large batch size by recompute. Fragmentation is a very serious problem under this setting. For example, if the self-attention is selected for recompute, the size of the attention matrix (temporary variable) cab be 1024 MB. After recomputation, this large memory block may be occupied by other tensors which don’t need recompute. If another self-attention needs recompute, torch’s allocator has to allocate another 1024 MB, leading to the fragmentation.

You mentioned that deepspeed ‘manages temp memory allocations itself’. I spent some time in reading deepspeed’s memory management mechanism but did not find related implementations. Could you kindly point out the source code?

stas · December 7, 2022, 5:03pm

Oh, I see, you’re doing some custom tweaking of the forward path.

Deepspeed avoids fragmentation by pre-allocating all the tensors it needs ahead of time and then just working directly with .data when it needs to move/update something. So unless user’s code does some unusual allocations past model’s inits the memory remains quite steady through the whole training.

Perhaps you could adopt a somewhat similar approach, by pre-allocating your large tensors and then re-using those and thus avoiding fragmentation.