I’m currently playing around with some transformers with variable batch sizes, and I’m running into pretty severe memory fragmentation issues, with CUDA OOM occurring at less than 70% GPU memory utilization. For example (see the GitHub link below for more extreme cases, of failure at <50% GPU memory):
RuntimeError: CUDA out of memory. Tried to allocate 1.48 GiB (GPU 0; 23.65 GiB total capacity; 16.22 GiB already allocated; 111.12 MiB free; 22.52 GiB reserved in total by PyTorch)
empty_cache()doesn’t increase the amount of GPU memory available for PyTorch. However, it may help reduce fragmentation of GPU memory in certain cases. See Memory management for more details about GPU memory management.
According to this thread, we shouldn’t be relying on
torch.cuda.empty_cache(). However (in my use case, at least),
torch.cuda.empty_cache() is the difference between running code with GiBs of GPU memory to spare and CUDA OOM errors. None of these past threads (as far as I can tell) really led to conclusive mitigation strategies for this problem.
My questions are:
- Are there “official” best practices for handling cases when GPU memory fragmentation is severely reducing effective usable GPU memory?
- Can we get more documentation about when
torch.cuda.empty_cache()is the right tool for addressing OOM/fragmentation issues?
As far as I can tell, moving all allocated tensors into main memory, emptying the cache, then moving them back to GPU memory a la this post seems a reasonable (if potentially expensive) strategy. Is this recommended?