Keep getting CUDA OOM error with Pytorch failing to allocate all free memory

I encounter random OOM errors during the model traning. It’s like:
RuntimeError: CUDA out of memory. Tried to allocate **8.60 GiB** (GPU 0; 23.70 GiB total capacity; 3.77 GiB already allocated; **8.60 GiB** free; 12.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

As you can see, Pytorch tried to allocate 8.60GiB, the exact amount of memory that’s free now according to the exception report, and failed.

This OOM error keeps poping up randomly during my training, i.e., I can’t locate the exact operation that causes it. Sometimes the last call in the trace is conv1d, sometimes it’s some backward(). And the amount of memory “tried to allocate” also changes all the time, but always remains the same as “free memory”.

I’ve been working on training a one-shot NAS model using the single-path method, i.e. at each iteration the model used to forward & backward is not the same. This further add some randomness to this strange error I think…

I’ve done some research on my own. Like setting PYTORCH_CUDA_ALLOC_CONF according to the pytorch doc and also setting PYTORCH_NO_CUDA_MEMORY_CACHING. This two env variable both seemingly solve the problem and help me locate the problem to the pytorch memory allocator with caching mechanism. Perhaps because the allocator is trying to cache all the free memory but fails, I think?

But these env variables can both greatly reduce the training speed. So I wonder if there’s someone who has encountered a similar case before and figured out how to solve it without harming the training efficiency. Thank you very much!

1 Like

Since the report shows the memory in GB it could still fail, if either your requested allocation is still larger or if your memory is fragmented and no large enough page can be created.

Thank you for your kind reply! Yes, I think fragmentation is more plausible to me as turning off fragmentation helps mitigate this OOM problem (by setting PYTORCH_CUDA_ALLOC_CONF).

I’m more curious about the fact that most OOM error I have here is raised by trying to allocate all the free memory and fails. The amount of memory “trying to allocate” varis every time, but is always the same(at least in GB) with free memory.
For most CUDA OOM errors I can find online, “trying to allocate” is bigger than “free”, and they can be tracked down to a specific operation, like creating a large tensor.
That’s the special&interesting point in my case. I want to figure out whether my strange OOM error is caused by some tensor creation or by a buggy allocator inside pytorch.
FYI my pytorch version is 1.10.0, coming from the official nvidia docker image.

The stacktrace should point to an operation which was failing so you could try to inspect it and check its memory requirement by printing the input shapes etc.

hi,
i am having cuda oom for some time now. but i couldnt find the reason.
memory fragmentation could be the reason…

it was until i upgraded to pytorch 1.9.0 (and now 1.10.0), start using ddp (with single gpu), train with minibatch > 1, and evaluate (for validation) with minibatch of 1, and the trainset of the datasets have a last minibatch of size (3, 4, 8) less than the size of the minibatch (32) (which i just realized it).

more likely cause of oom:

  • oom used to happen more frequently during validation (even at the first run before start training.). could validation with minibatch size 1 cause a huge memory fragmentation? now, i use torch.cuda.empty_cache() before start running evaluation and right after the end of all evaluation. i run few jobs using this strategy, i dont see oom in validation part. switched from nccl bacnked to gloo. some recommend against nccl for single gpu. it is not really necessary since there is no communication nor synch.

  • now, oom is happening during training. i dont want to use frequently torch.cuda.empty_cache() because i learned that it is expensive. is it possible that the last minibatch in trainset could cause a memory fragmentation? the error occurs after running many epochs. so, memory fragmentation is plausible cause as it does not occur right away.

    scaler.scale(loss).backward()
  File "home/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "home/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 14.76 GiB total capacity; 12.64 GiB already allocated; 161.75 MiB free; 13.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

the application here is for cnn.
i read somewhere that in nlp apps, they pad the minbatch to a large max sequence size (because sequences have different length) to:

  1. allocate the same blocksize for every minibatch.
  2. when using torch.backends.cudnn.benchmark=True, as i do, it is better to keep the minbatch size the same because cuda has tuned the operations to that size to avoid re-selection of algo every time.

so, probably the last minibatch is the one causing the memory fragmentation leading to oom.

less likely cause of oom:

  • pytorch allocator >= 1.9.0 has an issue? ddp makes it worse?

is this the right way to limit block splitting?

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

what is ‘‘best’’ max_split_size_mb value?
pytorch doc does not really explain much about this choice. they mentioned that this could have huge cost in term of performance (i assume speed) as no cost.
can you explain a bit about the cost?

i learned here that defragmenting cuda memory is not a thing, yet.

i have first to fix the last minbatch before trying export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb to avoid potential cost.

thanks

1 Like

filling the last minibatch to match the minibatchsize is not helping. still oom.

it looks like a gpu issue.
i run the same code over 4 different clusters that have different types of gpus.

  1. cluster 1: p100. 3gb < gpu mem used < 6gb. memory usage completely stable. no issue.
  2. cluster 2: v100. 3gb < gpu mem used < 6gb. memory usage completely stable. no issue.
  3. cluster 3: p100. 3gb < gpu mem used < 6gb. memory usage completely stable. no issue.
  4. cluster 4: t4. 6gb < gpu mem used < 16gb. memory usage completely insanely unstable. reached oom.

for 4th case, the memory usage starts at 6gb. than, after ~200 epochs of being stable, memory usage starts to oscillate between 6 to 8gb.
later, between 7 and 15gb… crossing 11, 12, 13, 14… gb
i use torch.cuda.empty_cache before and after validation.
i checked, there is no memory leak.
if there is one, it would appear in other cases as well. [there was one couple weeks ago, but i fixed it.]

any idea why t4 are behaving like this?
in all cases, i am using mixed precision training.
because they have plenty of tensor cores, i expect the memory usage to be the lowest.
i am also using ddp, but with single gpu with gloo backend.
the oom happens in the training phase.

pytorch 1.10.0 installation:
pip install torch==1.10.0 -f https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl

thanks

excuse me did you solve it ?

no.
i am benchmarking some methods. and this error happens only with a single method. but not sure what caused it. i am not sure it is method dependent. it could be pytorch version 1.10.0 or something else.

i did several changes in the code, things seem ok now (on other servers). but i didnt try the method that caused oom yet. also, other servers showed the same issue back than. so it is not a t4 issue.

if you are using ddp, the first thing you could try is to change the backend to mpi (check with your server to see the most stable backend). in version 1.10.0 the backend was causing many issues. now, with mpi, things are fine. but this is server dependent.

for validation, i was accidentally using the ddp_wrapped model. now, i use the true model.

search for memory leaks.
you may be tracking something that is not detached from the graph.

i’ll try the current version of the code on the t4 and report back.

i think torch 1.10.0 has an issue that causes oom.
if you still cant fix it, i recommend downgrading to torch.1.9.0.

thanks

1 Like