Unable to allocate cuda memory, when there is enough of cached memory

Can someone please explain this:

RuntimeError: CUDA out of memory. 
Tried to allocate 350.00 MiB 
(GPU 0; 7.93 GiB total capacity; 5.73 GiB already allocated; 
324.56 MiB free; 1.34 GiB cached)

If there is 1.34 GiB cached, how can it not allocate 350.00 MiB?

There is only one process running. torch-1.0.0/cuda10

And a related question:

Are there any tools to show which python objects consume GPU RAM (besides the pytorch preloaded structures which take some 0.5GB per process) ? i.e. is there some way to query pytorch for a reference to variables that are on CUDA and perhaps from there make some deductions?

Thank you.

12 Likes

If fragmentation of the blocks is in an unfortunate pattern, you’ll see that 1.34GiB is free, but there isn’t a large enough free block to allocate 324.56 GiB.

6 Likes

Thank you, @smth.

You must have meant to say “allocate 350MB” :wink:

Having 1.7GB available (free+cached) and not being able to use even 20% of it. Ouch!

Is there some function I can call to defrag it? I don’t care if it takes time to do so.

What causes such fragmentation and how can it be avoided? Perhaps this is documented already?

Thank you.

1 Like

Defrag is unfortunately not possible, because of the contract that pointers to Tensor data are immovable.

Usually, fragmentation occurs when you have small-size Tensors occupy the memory, and then get deallocated, while their larger counterparts are not getting deallocated. I’ve seen it happen sometimes when you have variable sequence length RNNs with a bit of an unfortunate luck added in.

7 Likes

So basically there is no solution for that.

In my case it was a very average case of images trained with resnet and then unable to run the predictions despite all that available memory. So I guess the only way to move forward (other than trying to use less memory during training) is to save the model, reset everything else that holds data on cuda and then run the predictions.

And the related question if you don’t mind answering: Are there any tools to show which python objects consume GPU RAM (besides the pytorch preloaded structures which take some 0.5GB per process) ? i.e. is there some way to query pytorch for a reference to variables that are on CUDA and perhaps from there make some deductions?

Thank you.

In my case it was a very average case of images trained with resnet and then unable to run the predictions despite all that available memory. So I guess the only way to move forward (other than trying to use less memory during training) is to save the model, reset everything else that holds data on cuda and then run the predictions.

That’s weird. Since it’s only for predictions, are they run in a with torch.no_grad(): block to hold no temporary buffers?

at the Python level, yes. Using the garbage collector’s inspector.

See:

How to debug causes of GPU memory leaks? for one code snippet to do this.

3 Likes

no_grad, yes, but there was some extras setup code consuming GPU, I will try to break it down into functional pieces to understand it better, but the bottom line is the same.

Perfect. Thank you for that link! I am going to experiment with that code next.

Thank you for your help, @smth.

1 Like

Apologies for resurrecting this - I am having the same issue regularly. I get the RuntimeError, as in the first message of this thread, the first time I send any data to the GPU.

I have exclusive access to the GPU, so I could solve my issue if I could force the GPU memory to be cleared or freed. Is there a function in torch which I can use to do this? I’ve reviewed the information about memory management on the docs here and I’m not entirely sure that torch.cuda.empty_cache() will resolve this.

An ideal solution for me would look something like:

...
torch.cuda.clear_memory_allocated()  # entirely clear all allocated memory
model = model.to(device)
...

Any advice well received.

1 Like

My feeling is that your issue is different from the one discussed here, @JamesOwers. You, obviously, need to free the variables that hold the GPU RAM (or switch them to cpu), you can’t tell pytorch to release them all for you since it’d lead to an inconsistent state of your interpreter.

  • Go over your code and free any variables you no longer need as soon as they aren’t not used anymore.

  • If you’re using a jupyter nb you could create a “virtual” scope using ipyexperiments, which can then automate the release.

  • If outside jupyter, wrap your code in a function and unless you create circular references once the function returns it’ll release the local variables and free up the memory for you.

Another important issue under jupyter is exceptions, please see: A guide to recovering from CUDA Out of Memory and other exceptions.

p.s. perhaps one could write something to automatically switch all cuda variables to cpu, diverting the “leak” to general RAM, which may help in a short term, but it’s not really solving the actual issue with your code, just delaying the inevitable.

1 Like

Hi @stas ,

Thanks for your reply. To be clear, I get this error the first time I send any data to the GPU.

That is, when I call model.to(device), this is the first variable to be sent to the GPU - unless I’m misunderstanding, at this point I don’t have any variables to clear. Despite this, I get the error. I am therefore presuming there is uncleared memory from a previous process.

To address the others: I’m not in a notebook, and this is within a function. Additionally, I do not get any error about 95 times out of 100 when running this code.

Cheers,

James

Well, what’s your GPU memory consumption is reported before you run this function? (nvidia-smi, or whatever other reporting tool do you use)

If it’s the first call, then you should have 100% GPU available before you do that call. I assume you’re with your own GPU card.

If you use some kind of online service, then it’s a different story.

If you start with GPU RAM already used up you should kill the previous processes if they didn’t quit.

Alternatively, it’s possible that you have 100% GPU RAM available but your very first variable is already bigger than the available GPU RAM.

It’s just very hard to diagnose your issue w/o you telling the full story - setup, size of GPU, local/online, etc.

In any case add some code to measure available RAM at the beginning of your code and an assert for it to bail if it can’t detect a sufficient amount of GPU RAM available, telling you to clean up any run-away processes if any.

@stas - again, much appreciate your input here. Appreciate your time helping me diagnose this.

I’ll describe the setup:

  • GPU cluster with a broad mix of different gpu types (Tesla K40m, GeForce Titan X, GeForce GTX Titan X, GeForce Titan X (Pascal))
  • Slurm job scheduler to coordinate job submission:
    • There are many users and my job will begin after another job has just finished
    • When my job begins, I have exclusive access to that GPU - the GPUs are only ever used by one user’s job at a time
  • It’s a service locally hosted by my university, so I can submit support tickets etc. I have reported the issue and we are struggling to fix. I’m here because I’m trying to find a simple workaround!

At the beginning of the job I report the usage with the tool GPUtil - but this uses nvidia-smi under the hood. The usage reported is always 0 - as expected, e.g.:

| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |

I know that my variable is smaller than the available RAM because I’ve measured the size of my model (it’s a few megabytes), and because the error message is slightly different from yours; mine follows the format - tried to allocate {small_number} ... {much_larger_number} free; ...). For example:

RuntimeError: CUDA out of memory. 
Tried to allocate 4.50 MiB (GPU 0; 11.91 GiB total capacity;
213.75 MiB already allocated; 11.18 GiB free; 509.50 KiB cached)

This is what has led me to the conclusion that the GPU has not been properly cleared after a previously running job has finished.

Your proposed solution to bail if there isn’t enough RAM at the start will not work - there is enough RAM according to nvidia-smi and indeed the error message. I imagine there is not enough contiguous memory!

Regardless, to fix, I think all I need to do is to clear the GPU’s memory at the beginning of my job (or simply wait until this is done). Is there a way to force this?

Alternatively, it could be that the GPU is clear, but the first variable is sent to the GPU memory in an extremely fragmented way. Is there any reason why this would happen?

Thank you for the additional information, @JamesOwers.

So your error message is very telling:

It says that you have 11GB (!) free and it can’t allocate 5MB - that makes no sense.

See this discussion where I tried to diagnose the non-contiguous memory just to discover that nvidia will re-allocate fragmented pages of at least 2MB to make contiguous memory. So unless your code somehow allocates memory that it only consumes a tiny fraction of each 2MB page, fragmenting 12GB of RAM this shouldn’t really happen.

So a few things I’d like to suggest in no particular order:

  1. catch that failure and add sleep so that the program doesn’t exit at that point of failure and check what nvidia-smi says about that card’s RAM status - what is the reported used/free memory there. This is to double check that perhaps there is something wrong with the card and that it reports wrong numbers.

  2. Since you said it happens 5% of the time, did you observe that it perhaps happens with the same specific card? i.e. again a faulty card?

  3. can you reliably reproduce when you hit that 5% situation?

  4. reduce your variable size by say half - does it fit into the memory? if not half again and so on - see what fits

  5. when that error happens, can you catch it and then try to allocate a simple large tensor say torch.zeros() of a few GBs? torch.ones((n*2**18)).cuda().contiguous() where n is the number of desired MBs - and adjust cuda() to match your setup if needed to(...)

My feeling is that your array of cards has a faulty card. That last suggestion could be the key - allocate 10GB of RAM (say 80% of the card’s capacity) and free it right away at the beginning of your program - if it fails, you don’t want to use that card.

1 Like

@stas - many thanks for this. I’m going to implement your suggestion of attempting to allocate some known large tensor right at the start of the job, and report & rerun upon failure.

Very much appreciate your help. Thank you.

1 Like

Hello Guys,

If your batch size is a large block try to reduce it. I was using a batch_size = 1024 and when I reduced it to 128 it worked like a charm!!

hope this is useful.

regards,

Running into the same problem. Do we have a general documentation or a blog that explains the RCA and the solution?

Stackstrace:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
 in 
      1 #training starts
      2 ep = 30
----> 3 train_net(ep)

 in train_net(n_epochs)
     19 
     20             # forward pass
---> 21             output = net(images)
     22             #print("output.type", output.type())
     23             #output = output.type(torch.cuda.FloatTensor)

~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

 in forward(self, x)
     23 
     24     def forward(self, x):
---> 25         x = F.relu(self.batch1(self.conv1(x)))
     26         x = F.relu(self.batch1(self.conv1a(x)))
     27         x = self.pool1(x)

~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~\Anaconda3\lib\site-packages\torch\nn\modules\conv.py in forward(self, input)
    343 
    344     def forward(self, input):
--> 345         return self.conv2d_forward(input, self.weight)
    346 
    347 class Conv3d(_ConvNd):

~\Anaconda3\lib\site-packages\torch\nn\modules\conv.py in conv2d_forward(self, input, weight)
    340                             _pair(0), self.dilation, self.groups)
    341         return F.conv2d(input, weight, self.bias, self.stride,
--> 342                         self.padding, self.dilation, self.groups)
    343 
    344     def forward(self, input):

RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 4.00 GiB total capacity; 2.57 GiB already allocated; 16.20 MiB free; 2.64 GiB reserved in total by PyTorch)

As the error message states your GPU is running out of memory, so you would need to either reduce the batch size, the model itself, or could potentially trade compute for memory using torch.utils.checkpoint.

1 Like

hi everyone, i have gtx 1060 6GB , and i got this error message:

RuntimeError: CUDA out of memory. Tried to allocate 200.00 MiB (GPU 0; 6.00 GiB total capacity; 2.09 GiB already allocated; 2.47 GiB free; 13.55 MiB cached)

but that is not make any sense, any help ???

Well, you may want to read this thread from the top - as it discusses this problem - and then it’d make sense, thanks to the helpful replies of others.

I’m having a similar problem with memory:

Tried to allocate 2.00 MiB (GPU 0; 11.00 GiB total capacity; 9.44 GiB already allocated; 997.01 MiB free; 10.01 GiB reserved in total by PyTorch)

I don’t think I have the fragmentation issue discussed above, but 2 MB shouldn’t be a problem (I’m using a really small batch size).
I’ve also tried running on 2 GPUs that are bridged with an SLI bridge. This gives me a total of 22 GB, but I’m getting the same error message with 11.00 GiB. Does Pytorch support GPUs that are bridged?