Unable to allocate cuda memory, when there is enough of cached memory

mathematics · August 3, 2020, 2:28pm

Hi,
How you’ve solved this problem? @stas
I’m getting this error , Help!!! @ptrblck

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 2.00 GiB total capacity; 1.09 GiB already allocated; 45.82 MiB free; 1.11 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\c10\cuda\CUDACachingAllocator.cpp:272 (most recent call first):
00007FFEE82575A200007FFEE8257540 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFEE81F9C0600007FFEE81F9B90 c10_cuda.dll!c10::CUDAOutOfMemoryError::CUDAOutOfMemoryError [<unknown file> @ <unknown line number>]
00007FFEE820069600007FFEE81FF370 c10_cuda.dll!c10::cuda::CUDACachingAllocator::init [<unknown file> @ <unknown line number>]
00007FFEE820083A00007FFEE81FF370 c10_cuda.dll!c10::cuda::CUDACachingAllocator::init [<unknown file> @ <unknown line number>]
00007FFEE81F509900007FFEE81F4EB0 c10_cuda.dll!c10::cuda::CUDAStream::unpack [<unknown file> @ <unknown line number>]
00007FFE86D91FF100007FFE86D91EB0 torch_cuda.dll!at::native::empty_cuda [<unknown file> @ <unknown line number>]
00007FFE86EA8AFE00007FFE86E4E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFE86EA42A500007FFE86E4E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFE7EEA1A3A00007FFE7EE8D9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FFE7EEA000500007FFE7EE8D9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FFE7EF718A000007FFE7EF68FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFE7EF828DC00007FFE7EF82850 torch_cpu.dll!at::empty [<unknown file> @ <unknown line number>]
00007FFE8634F5E400007FFE8634F560 torch_cuda.dll!at::native::mm_cuda [<unknown file> @ <unknown line number>]
00007FFE86EB1B0F00007FFE86E4E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFE86EA1B2200007FFE86E4E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFE7EF6D94900007FFE7EF68FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFE7EFA057700007FFE7EFA0520 torch_cpu.dll!at::mm [<unknown file> @ <unknown line number>]
00007FFE802FEC7900007FFE8020E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFE7EAB715700007FFE7EAB6290 torch_cpu.dll!at::indexing::TensorIndex::boolean [<unknown file> @ <unknown line number>]
00007FFE7EF6D94900007FFE7EF68FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFE7F08210700007FFE7F0820B0 torch_cpu.dll!at::Tensor::mm [<unknown file> @ <unknown line number>]
00007FFE8019B96900007FFE8019A760 torch_cpu.dll!torch::autograd::profiler::Event::kind [<unknown file> @ <unknown line number>]
00007FFE801517EC00007FFE80151580 torch_cpu.dll!torch::autograd::generated::AddmmBackward::apply [<unknown file> @ <unknown line number>]
00007FFE80147E9100007FFE80147B50 torch_cpu.dll!torch::autograd::Node::operator() [<unknown file> @ <unknown line number>]
00007FFE806AF9BA00007FFE806AF300 torch_cpu.dll!torch::autograd::Engine::add_thread_pool_task [<unknown file> @ <unknown line number>]
00007FFE806B03AD00007FFE806AFFD0 torch_cpu.dll!torch::autograd::Engine::evaluate_function [<unknown file> @ <unknown line number>]
00007FFE806B4FE200007FFE806B4CA0 torch_cpu.dll!torch::autograd::Engine::thread_main [<unknown file> @ <unknown line number>]
00007FFE806B4C4100007FFE806B4BC0 torch_cpu.dll!torch::autograd::Engine::thread_init [<unknown file> @ <unknown line number>]
00007FFEC38608F700007FFEC3839F80 torch_python.dll!THPShortStorage_New [<unknown file> @ <unknown line number>]
00007FFE806ABF1400007FFE806AB780 torch_cpu.dll!torch::autograd::Engine::get_base_engine [<unknown file> @ <unknown line number>]
00007FFF160A0E8200007FFF160A0D40 ucrtbase.dll!beginthreadex [<unknown file> @ <unknown line number>]
00007FFF188A7BD400007FFF188A7BC0 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FFF190ECE5100007FFF190ECE30 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]

ptrblck · August 4, 2020, 4:25am

You are running out of memory, so you would need to reduce the batch size of the overall model architecture. Note that your GPU has 2GB, which would limit the executable workloads on this device.

You could also try to use torch.utils.checkpoints to trade compute for memory.

mathematics · August 4, 2020, 6:55am

reducing to smallest batch_size =2 still didnt worked. Giving error,
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 2.00 GiB total capacity; 1.01 GiB already allocated; 105.76 MiB free; 1.05 GiB reserved in total by PyTorch)

I tried to do restart and things, but it dont worked.
when using without cuda, notebook freezes on running both locally and in colab.

Oh it might be problem in my implementation, pretrained network using cuda working.

stas · August 4, 2020, 3:42pm

It could be that your GPU is just too small for the job you’re trying to do. Perhaps use Colab to train (free) and then your GPU for finetune/inference?

mathematics · August 4, 2020, 4:47pm

Yes it might be. its great idea using colab and fine tuning locally.

ultramarine · August 4, 2020, 5:31pm

I think i have a similar issue. Model is a BiLSTM+CRF. Random spiking of GPU memory usage and then RuntimeError: CUDA out of memory. Larger batch size worked fine. Smaller batch size worked fine once and couple of other times it ended in runtime error.

All experiments have same parameters except the following:
Light blue - batch size 128
All others - batch size 32

stas · August 4, 2020, 5:48pm

Have a look at this memory profiler/monitor if you’re running in a jupyter notebook - https://github.com/stas00/ipyexperiments - it might help you to identify where you lose that memory.

ultramarine · August 4, 2020, 5:59pm

Not running in a notebook. But this might give some clues. Thanks.

I am afraid if I have the issue that @smth mentioned. I am working with variable sequence length RNNs.

ptrblck · August 7, 2020, 5:27am

This issue seems to be solved here or are you still seeing it?

saba · August 12, 2020, 7:02am

Hi Ptrblck,

I face a CUDA memory issue after running for one epoch.

“RuntimeError: CUDA out of memory. Tried to allocate 1.64 GiB (GPU 0; 15.90 GiB total capacity; 13.96 GiB already allocated; 393.38 MiB free; 904.64 MiB cached)”

would you please help me with that?

ptrblck · August 12, 2020, 8:56am

Since you are running out of memory, you would need to lower the batch size or you could have a look at torch.utils.checkpoint to trade compute for memory.
Also, if not already done, wrap the validation loop in a with torch.no_grad() block, and avoid storing tensors, which are not detached from the computation graph.

add023 · August 31, 2020, 5:29pm

@ptrblck
@mikey_t: Did you solve your problem?
I have the same issue.
RuntimeError: CUDA out of memory. Tried to allocate 82.00 MiB (GPU 0; 15.78 GiB total capacity; 14.60 GiB already allocated; 15.44 MiB free; 14.70 GiB reserved in total by PyTorch)
Before starting the training, nvidia-smi says 0MB is used and no processes are running. I am running it in one Tesla V100-SXM2 GPU.

My batch size is 1 which is approximately 150 images. I feed it to a pre-trained resnet18 pytorch model whose output embedding is fed to a transformer encoder and then finally to a CTC loss function. The total params of the model is Total params: 37818496

The fail trace shows out of memory in resnet forward pass:
File "/home/####/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups)

    def conv2d_forward(self, input, weight):
        if self.padding_mode == 'circular':
            expanded_padding = ((self.padding[1] + 1) // 2, self.padding[1] // 2,
                                (self.padding[0] + 1) // 2, self.padding[0] // 2)
            return F.conv2d(F.pad(input, expanded_padding, mode='circular'),
                            weight, self.bias, self.stride,
                            _pair(0), self.dilation, self.groups)
        return F.conv2d(input, weight, self.bias, self.stride,
                        self.padding, self.dilation, self.groups)

In principle this should be easily handled by the GPU. This error pops in the first epoch. What does 14.6 GiB already allocated really mean?

Happy to try somethings out if you have suggestions.

mikey_t · August 31, 2020, 5:50pm

@add023 I think I solved this by setting batch_size to 2 (even though it’s larger, it worked for me for some reason). I also ran the model in parallel on my GPUs.

add023 · August 31, 2020, 5:53pm

I see. Thanks for the reply. What do you mean by running the model in parallel? Is it both data parallelism and model parallelism?

Ankita_Pandey · October 13, 2020, 1:30am

Hey,
I am facing a similar issue with the Runtime CUDA Out of Memory. Could you please help me resolving the same?

ptrblck · October 14, 2020, 7:01am

Try to lower the batch size to reduce the memory usage.
If that’s not possible, you could use torch.utils.checkpoint to trade compute for memory.

magicly · December 23, 2020, 10:43am

the pytorch preloaded structures which take some 0.5GB per process

It happens to me. And more worse, in pytorch 1.6, it cost 905M, and in pytorch 1.7, it cost 961M. It seems like:

This is probably caused by the cuda runtime loading the kernel images.
Massive initial memory overhead GPU · Issue #12873 · pytorch/pytorch · GitHub

huanyu_zang · April 5, 2021, 3:50pm

I got the similar problem. My data only includes around 1k imgaes in 143*183 resolution. I set batch size to 32. I’m using ResNet34. And I still got the error. I set ‘Xmx to 2048m’. That’s so weird.

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 8.00 GiB total capacity; 6.09 GiB already allocated; 39.75 MiB free; 6.28 GiB reserved in total by PyTorch)

ptrblck · April 5, 2021, 6:08pm

The available 8GB might not be enough to run the model in this setup as the error message indicates that <40MB are free on the device.
Did you make sure that the GPU is completely empty via e.g. nvidia-smi before starting the training?

huanyu_zang · April 5, 2021, 9:45pm

Thanks for response. I’m not familiar with this solution, so I searched Google. I got this one:

nvidia -smi
and select PID that want to kill
sudo kill -9 PID

Is this one correct?

And also I got another solution:

import gc
gc.collect()
torch.cuda.empty_cache

Does this solution also work?

I just wanna make sure that I will not lose important results, cuz it takes long time to run once. Thanks for your response again!