CUDA out of memory issue

leoxiaobin · January 22, 2019, 3:25am

hi,
When I try to train a model using Pytorch 1.0.0 with 8GPUs, I get the error as below:

‘’’
RuntimeError: CUDA out of memory. Tried to allocate 30.38 MiB (GPU 0; 15.75 GiB total capacity; 7.41 GiB already allocated; 8.94 MiB free; 40.77 MiB cached)
‘’’

And it not always happens, sometimes it can successfully run, but sometimes it throws the CUDA out of memory error.

It seems there is enough memory(15.75 GiB total, only 7.41 GiB used, tried to allocate 30 MiB).

Do anyone know how to solve it?

JuanFMontesinos · January 22, 2019, 11:20am

Are u sharing those gpus? is the runtime error consistent with nvidia-smi?

leoxiaobin · January 22, 2019, 11:27am

No share with other task. And there is also no other job on those GPUs.

colesbury · January 22, 2019, 9:29pm

Can you provide more details about your training? What type of model? What type of GPUs?

We’re seeing similar issues internally (also involving distributed training). I’m not sure what the problem is yet, but we’re looking into it.

leoxiaobin · January 23, 2019, 1:18pm

The GPU type is V100 with 16G memory. The trained model is our internal model. For single GPU training, it can work well.

I have also tried it in several different clusters. For another two clusters, which have the GPUs of P100 and V100 respectively, the training can work well.

I have also tried to use smaller batch size, even though our GPU memory is enough for the normal batch size. But for smaller batch size, it also raised the CUDA out of memory error.

I have tried to use smaller input image size, and it can works well.

leoxiaobin · January 30, 2019, 12:27pm

hi, colesbury, have you solved the problem?

y91 · March 17, 2019, 8:47pm

I am facing a similar issue:

CUDA out of memory. Tried to allocate 196.50 MiB (GPU 0; 15.75 GiB total capacity; 7.09 GiB already allocated; 20.62 MiB free; 72.48 MiB cached)

It looks like there is enough memory left, yet I get an OOM error. Do we have any update on this ?

Thanks!

santi-pdp · July 22, 2019, 12:41am

Hi I am facing the same issue:
RuntimeError: CUDA out of memory. Tried to allocate 1.86 GiB (GPU 0; 15.75 GiB total capacity; 6.25 GiB already allocated; 8.44 GiB free; 17.78 MiB cached)

Any advances on this? Thanks guys!

General_Intelligence · February 5, 2020, 9:12pm

run
torch.cuda.empty_cache() to clear the GPU memory then restart the kernel and or close and re-open the kernel