Mitigating CUDA GPU memory fragmentation and OOM issues

I’m currently playing around with some transformers with variable batch sizes, and I’m running into pretty severe memory fragmentation issues, with CUDA OOM occurring at less than 70% GPU memory utilization. For example (see the GitHub link below for more extreme cases, of failure at <50% GPU memory):

RuntimeError: CUDA out of memory. Tried to allocate 1.48 GiB (GPU 0; 23.65 GiB total capacity; 16.22 GiB already allocated; 111.12 MiB free; 22.52 GiB reserved in total by PyTorch)

This has been discussed before on the PyTorch forums [1, 2] and GitHub. Fragmentation is also mentioned briefly in the docs for torch.cuda.empty_cache():

empty_cache() doesn’t increase the amount of GPU memory available for PyTorch. However, it may help reduce fragmentation of GPU memory in certain cases. See Memory management for more details about GPU memory management.

According to this thread, we shouldn’t be relying on torch.cuda.empty_cache(). However (in my use case, at least), torch.cuda.empty_cache() is the difference between running code with GiBs of GPU memory to spare and CUDA OOM errors. None of these past threads (as far as I can tell) really led to conclusive mitigation strategies for this problem.

My questions are:

  1. Are there “official” best practices for handling cases when GPU memory fragmentation is severely reducing effective usable GPU memory?
  2. Can we get more documentation about when torch.cuda.empty_cache() is the right tool for addressing OOM/fragmentation issues?

As far as I can tell, moving all allocated tensors into main memory, emptying the cache, then moving them back to GPU memory a la this post seems a reasonable (if potentially expensive) strategy. Is this recommended?

In no way to steal from the focus of this super important issue you raise, I am curious how your code would fair with the DeepSpeed integration - since it manages temp memory allocations itself. And in my few experiments it needs 3-5x less gpu RAM with its ZeRO algorithms. that is if you’re using the HF trainer. If you don’t use the latter, you can see from the code how to activate deepspeed in your own trainer.

We are getting close to completing the integration but you can already try my branch:

There is a doc in it that explains how to activate it.
You’d need to install DeepSpeed from its master.

If you have any follow up questions please ask in that PR thread so that we don’t derail this thread.

We also recently added support for Sharded DDP via fairscale. if you use HF trainer, just install fairscale and add --sharded_ddp to the training args and you should also see a huge improvement in memory utilization. Again if you have questions let’s continue this discussion on HF forums.

Back to the topic of this thread.

Very interesting, thanks for sharing! I’m not using the HuggingFace trainer, but I went ahead and ran with deepspeed with my current code.

I normally run:

ENV=var python -m train --arg1 val1 --arg2 val2

I ran:

ENV=var deepspeed ./train.py --arg1 val1 --arg2 val2

I don’t see any difference in runtime/memory usage, though. Is there anything else I need to do? My ds_report looks like this:

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------

Did you pre-build the extension ops? I’ve got a CUDA Version mismatch that’s preventing me from pre-building them with DS_BUILD_OPS=1 pip install deepspeed.

Based on previous experience - continuing the ZeRO solution discussion here will derail your attempt to find core solutions. I highly recommend removing these follow ups and opening a separate thread on HF forums (and tag @stas) and we can continue there.

But the quick answer is that based on your reply - you’re not using deepspeed, you’re just using its launcher. You need to:

  1. create a configuration file which sets up its ZeRO magic.See: DeepSpeed Configuration JSON - DeepSpeed
  2. activate deepspeed Getting Started - DeepSpeed
  3. use its wrapped model in training Getting Started - DeepSpeed
  4. launch it as Getting Started - DeepSpeed

That’s when you will see huge improvements.