Unable to allocate cuda memory, when there is enough of cached memory

If fragmentation of the blocks is in an unfortunate pattern, you’ll see that 1.34GiB is free, but there isn’t a large enough free block to allocate 324.56 GiB.


Thank you, @smth.

You must have meant to say “allocate 350MB” :wink:

Having 1.7GB available (free+cached) and not being able to use even 20% of it. Ouch!

Is there some function I can call to defrag it? I don’t care if it takes time to do so.

What causes such fragmentation and how can it be avoided? Perhaps this is documented already?

Thank you.

1 Like

Defrag is unfortunately not possible, because of the contract that pointers to Tensor data are immovable.

Usually, fragmentation occurs when you have small-size Tensors occupy the memory, and then get deallocated, while their larger counterparts are not getting deallocated. I’ve seen it happen sometimes when you have variable sequence length RNNs with a bit of an unfortunate luck added in.


So basically there is no solution for that.

In my case it was a very average case of images trained with resnet and then unable to run the predictions despite all that available memory. So I guess the only way to move forward (other than trying to use less memory during training) is to save the model, reset everything else that holds data on cuda and then run the predictions.

And the related question if you don’t mind answering: Are there any tools to show which python objects consume GPU RAM (besides the pytorch preloaded structures which take some 0.5GB per process) ? i.e. is there some way to query pytorch for a reference to variables that are on CUDA and perhaps from there make some deductions?

Thank you.

In my case it was a very average case of images trained with resnet and then unable to run the predictions despite all that available memory. So I guess the only way to move forward (other than trying to use less memory during training) is to save the model, reset everything else that holds data on cuda and then run the predictions.

That’s weird. Since it’s only for predictions, are they run in a with torch.no_grad(): block to hold no temporary buffers?

at the Python level, yes. Using the garbage collector’s inspector.


How to debug causes of GPU memory leaks? for one code snippet to do this.


no_grad, yes, but there was some extras setup code consuming GPU, I will try to break it down into functional pieces to understand it better, but the bottom line is the same.

Perfect. Thank you for that link! I am going to experiment with that code next.

Thank you for your help, @smth.

1 Like

Apologies for resurrecting this - I am having the same issue regularly. I get the RuntimeError, as in the first message of this thread, the first time I send any data to the GPU.

I have exclusive access to the GPU, so I could solve my issue if I could force the GPU memory to be cleared or freed. Is there a function in torch which I can use to do this? I’ve reviewed the information about memory management on the docs here and I’m not entirely sure that torch.cuda.empty_cache() will resolve this.

An ideal solution for me would look something like:

torch.cuda.clear_memory_allocated()  # entirely clear all allocated memory
model = model.to(device)

Any advice well received.

1 Like

My feeling is that your issue is different from the one discussed here, @JamesOwers. You, obviously, need to free the variables that hold the GPU RAM (or switch them to cpu), you can’t tell pytorch to release them all for you since it’d lead to an inconsistent state of your interpreter.

  • Go over your code and free any variables you no longer need as soon as they aren’t not used anymore.

  • If you’re using a jupyter nb you could create a “virtual” scope using ipyexperiments, which can then automate the release.

  • If outside jupyter, wrap your code in a function and unless you create circular references once the function returns it’ll release the local variables and free up the memory for you.

Another important issue under jupyter is exceptions, please see: A guide to recovering from CUDA Out of Memory and other exceptions.

p.s. perhaps one could write something to automatically switch all cuda variables to cpu, diverting the “leak” to general RAM, which may help in a short term, but it’s not really solving the actual issue with your code, just delaying the inevitable.

1 Like

Hi @stas ,

Thanks for your reply. To be clear, I get this error the first time I send any data to the GPU.

That is, when I call model.to(device), this is the first variable to be sent to the GPU - unless I’m misunderstanding, at this point I don’t have any variables to clear. Despite this, I get the error. I am therefore presuming there is uncleared memory from a previous process.

To address the others: I’m not in a notebook, and this is within a function. Additionally, I do not get any error about 95 times out of 100 when running this code.



Well, what’s your GPU memory consumption is reported before you run this function? (nvidia-smi, or whatever other reporting tool do you use)

If it’s the first call, then you should have 100% GPU available before you do that call. I assume you’re with your own GPU card.

If you use some kind of online service, then it’s a different story.

If you start with GPU RAM already used up you should kill the previous processes if they didn’t quit.

Alternatively, it’s possible that you have 100% GPU RAM available but your very first variable is already bigger than the available GPU RAM.

It’s just very hard to diagnose your issue w/o you telling the full story - setup, size of GPU, local/online, etc.

In any case add some code to measure available RAM at the beginning of your code and an assert for it to bail if it can’t detect a sufficient amount of GPU RAM available, telling you to clean up any run-away processes if any.

@stas - again, much appreciate your input here. Appreciate your time helping me diagnose this.

I’ll describe the setup:

  • GPU cluster with a broad mix of different gpu types (Tesla K40m, GeForce Titan X, GeForce GTX Titan X, GeForce Titan X (Pascal))
  • Slurm job scheduler to coordinate job submission:
    • There are many users and my job will begin after another job has just finished
    • When my job begins, I have exclusive access to that GPU - the GPUs are only ever used by one user’s job at a time
  • It’s a service locally hosted by my university, so I can submit support tickets etc. I have reported the issue and we are struggling to fix. I’m here because I’m trying to find a simple workaround!

At the beginning of the job I report the usage with the tool GPUtil - but this uses nvidia-smi under the hood. The usage reported is always 0 - as expected, e.g.:

| ID | GPU | MEM |
|  0 |  0% |  0% |

I know that my variable is smaller than the available RAM because I’ve measured the size of my model (it’s a few megabytes), and because the error message is slightly different from yours; mine follows the format - tried to allocate {small_number} ... {much_larger_number} free; ...). For example:

RuntimeError: CUDA out of memory. 
Tried to allocate 4.50 MiB (GPU 0; 11.91 GiB total capacity;
213.75 MiB already allocated; 11.18 GiB free; 509.50 KiB cached)

This is what has led me to the conclusion that the GPU has not been properly cleared after a previously running job has finished.

Your proposed solution to bail if there isn’t enough RAM at the start will not work - there is enough RAM according to nvidia-smi and indeed the error message. I imagine there is not enough contiguous memory!

Regardless, to fix, I think all I need to do is to clear the GPU’s memory at the beginning of my job (or simply wait until this is done). Is there a way to force this?

Alternatively, it could be that the GPU is clear, but the first variable is sent to the GPU memory in an extremely fragmented way. Is there any reason why this would happen?

Thank you for the additional information, @JamesOwers.

So your error message is very telling:

It says that you have 11GB (!) free and it can’t allocate 5MB - that makes no sense.

See this discussion where I tried to diagnose the non-contiguous memory just to discover that nvidia will re-allocate fragmented pages of at least 2MB to make contiguous memory. So unless your code somehow allocates memory that it only consumes a tiny fraction of each 2MB page, fragmenting 12GB of RAM this shouldn’t really happen.

So a few things I’d like to suggest in no particular order:

  1. catch that failure and add sleep so that the program doesn’t exit at that point of failure and check what nvidia-smi says about that card’s RAM status - what is the reported used/free memory there. This is to double check that perhaps there is something wrong with the card and that it reports wrong numbers.

  2. Since you said it happens 5% of the time, did you observe that it perhaps happens with the same specific card? i.e. again a faulty card?

  3. can you reliably reproduce when you hit that 5% situation?

  4. reduce your variable size by say half - does it fit into the memory? if not half again and so on - see what fits

  5. when that error happens, can you catch it and then try to allocate a simple large tensor say torch.zeros() of a few GBs? torch.ones((n*2**18)).cuda().contiguous() where n is the number of desired MBs - and adjust cuda() to match your setup if needed to(...)

My feeling is that your array of cards has a faulty card. That last suggestion could be the key - allocate 10GB of RAM (say 80% of the card’s capacity) and free it right away at the beginning of your program - if it fails, you don’t want to use that card.

1 Like

@stas - many thanks for this. I’m going to implement your suggestion of attempting to allocate some known large tensor right at the start of the job, and report & rerun upon failure.

Very much appreciate your help. Thank you.

1 Like

Hello Guys,

If your batch size is a large block try to reduce it. I was using a batch_size = 1024 and when I reduced it to 128 it worked like a charm!!

hope this is useful.


Running into the same problem. Do we have a general documentation or a blog that explains the RCA and the solution?


RuntimeError                              Traceback (most recent call last)
      1 #training starts
      2 ep = 30
----> 3 train_net(ep)

 in train_net(n_epochs)
     20             # forward pass
---> 21             output = net(images)
     22             #print("output.type", output.type())
     23             #output = output.type(torch.cuda.FloatTensor)

~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

 in forward(self, x)
     24     def forward(self, x):
---> 25         x = F.relu(self.batch1(self.conv1(x)))
     26         x = F.relu(self.batch1(self.conv1a(x)))
     27         x = self.pool1(x)

~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~\Anaconda3\lib\site-packages\torch\nn\modules\conv.py in forward(self, input)
    344     def forward(self, input):
--> 345         return self.conv2d_forward(input, self.weight)
    347 class Conv3d(_ConvNd):

~\Anaconda3\lib\site-packages\torch\nn\modules\conv.py in conv2d_forward(self, input, weight)
    340                             _pair(0), self.dilation, self.groups)
    341         return F.conv2d(input, weight, self.bias, self.stride,
--> 342                         self.padding, self.dilation, self.groups)
    344     def forward(self, input):

RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 4.00 GiB total capacity; 2.57 GiB already allocated; 16.20 MiB free; 2.64 GiB reserved in total by PyTorch)

As the error message states your GPU is running out of memory, so you would need to either reduce the batch size, the model itself, or could potentially trade compute for memory using torch.utils.checkpoint.

1 Like

hi everyone, i have gtx 1060 6GB , and i got this error message:

RuntimeError: CUDA out of memory. Tried to allocate 200.00 MiB (GPU 0; 6.00 GiB total capacity; 2.09 GiB already allocated; 2.47 GiB free; 13.55 MiB cached)

but that is not make any sense, any help ???

Well, you may want to read this thread from the top - as it discusses this problem - and then it’d make sense, thanks to the helpful replies of others.

I’m having a similar problem with memory:

Tried to allocate 2.00 MiB (GPU 0; 11.00 GiB total capacity; 9.44 GiB already allocated; 997.01 MiB free; 10.01 GiB reserved in total by PyTorch)

I don’t think I have the fragmentation issue discussed above, but 2 MB shouldn’t be a problem (I’m using a really small batch size).
I’ve also tried running on 2 GPUs that are bridged with an SLI bridge. This gives me a total of 22 GB, but I’m getting the same error message with 11.00 GiB. Does Pytorch support GPUs that are bridged?

How you’ve solved this problem? @stas
I’m getting this error , Help!!! @ptrblck

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 2.00 GiB total capacity; 1.09 GiB already allocated; 45.82 MiB free; 1.11 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\c10\cuda\CUDACachingAllocator.cpp:272 (most recent call first):
00007FFEE82575A200007FFEE8257540 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFEE81F9C0600007FFEE81F9B90 c10_cuda.dll!c10::CUDAOutOfMemoryError::CUDAOutOfMemoryError [<unknown file> @ <unknown line number>]
00007FFEE820069600007FFEE81FF370 c10_cuda.dll!c10::cuda::CUDACachingAllocator::init [<unknown file> @ <unknown line number>]
00007FFEE820083A00007FFEE81FF370 c10_cuda.dll!c10::cuda::CUDACachingAllocator::init [<unknown file> @ <unknown line number>]
00007FFEE81F509900007FFEE81F4EB0 c10_cuda.dll!c10::cuda::CUDAStream::unpack [<unknown file> @ <unknown line number>]
00007FFE86D91FF100007FFE86D91EB0 torch_cuda.dll!at::native::empty_cuda [<unknown file> @ <unknown line number>]
00007FFE86EA8AFE00007FFE86E4E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFE86EA42A500007FFE86E4E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFE7EEA1A3A00007FFE7EE8D9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FFE7EEA000500007FFE7EE8D9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FFE7EF718A000007FFE7EF68FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFE7EF828DC00007FFE7EF82850 torch_cpu.dll!at::empty [<unknown file> @ <unknown line number>]
00007FFE8634F5E400007FFE8634F560 torch_cuda.dll!at::native::mm_cuda [<unknown file> @ <unknown line number>]
00007FFE86EB1B0F00007FFE86E4E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFE86EA1B2200007FFE86E4E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFE7EF6D94900007FFE7EF68FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFE7EFA057700007FFE7EFA0520 torch_cpu.dll!at::mm [<unknown file> @ <unknown line number>]
00007FFE802FEC7900007FFE8020E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFE7EAB715700007FFE7EAB6290 torch_cpu.dll!at::indexing::TensorIndex::boolean [<unknown file> @ <unknown line number>]
00007FFE7EF6D94900007FFE7EF68FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFE7F08210700007FFE7F0820B0 torch_cpu.dll!at::Tensor::mm [<unknown file> @ <unknown line number>]
00007FFE8019B96900007FFE8019A760 torch_cpu.dll!torch::autograd::profiler::Event::kind [<unknown file> @ <unknown line number>]
00007FFE801517EC00007FFE80151580 torch_cpu.dll!torch::autograd::generated::AddmmBackward::apply [<unknown file> @ <unknown line number>]
00007FFE80147E9100007FFE80147B50 torch_cpu.dll!torch::autograd::Node::operator() [<unknown file> @ <unknown line number>]
00007FFE806AF9BA00007FFE806AF300 torch_cpu.dll!torch::autograd::Engine::add_thread_pool_task [<unknown file> @ <unknown line number>]
00007FFE806B03AD00007FFE806AFFD0 torch_cpu.dll!torch::autograd::Engine::evaluate_function [<unknown file> @ <unknown line number>]
00007FFE806B4FE200007FFE806B4CA0 torch_cpu.dll!torch::autograd::Engine::thread_main [<unknown file> @ <unknown line number>]
00007FFE806B4C4100007FFE806B4BC0 torch_cpu.dll!torch::autograd::Engine::thread_init [<unknown file> @ <unknown line number>]
00007FFEC38608F700007FFEC3839F80 torch_python.dll!THPShortStorage_New [<unknown file> @ <unknown line number>]
00007FFE806ABF1400007FFE806AB780 torch_cpu.dll!torch::autograd::Engine::get_base_engine [<unknown file> @ <unknown line number>]
00007FFF160A0E8200007FFF160A0D40 ucrtbase.dll!beginthreadex [<unknown file> @ <unknown line number>]
00007FFF188A7BD400007FFF188A7BC0 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FFF190ECE5100007FFF190ECE30 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]