CUDA out of memory issue

hi,
When I try to train a model using Pytorch 1.0.0 with 8GPUs, I get the error as below:

‘’’
RuntimeError: CUDA out of memory. Tried to allocate 30.38 MiB (GPU 0; 15.75 GiB total capacity; 7.41 GiB already allocated; 8.94 MiB free; 40.77 MiB cached)
‘’’

And it not always happens, sometimes it can successfully run, but sometimes it throws the CUDA out of memory error.

It seems there is enough memory(15.75 GiB total, only 7.41 GiB used, tried to allocate 30 MiB).

Do anyone know how to solve it?

Are u sharing those gpus? is the runtime error consistent with nvidia-smi?

No share with other task. And there is also no other job on those GPUs.

Can you provide more details about your training? What type of model? What type of GPUs?

We’re seeing similar issues internally (also involving distributed training). I’m not sure what the problem is yet, but we’re looking into it.

The GPU type is V100 with 16G memory. The trained model is our internal model. For single GPU training, it can work well.

I have also tried it in several different clusters. For another two clusters, which have the GPUs of P100 and V100 respectively, the training can work well.

I have also tried to use smaller batch size, even though our GPU memory is enough for the normal batch size. But for smaller batch size, it also raised the CUDA out of memory error.

I have tried to use smaller input image size, and it can works well.

hi, colesbury, have you solved the problem?

I am facing a similar issue:

CUDA out of memory. Tried to allocate 196.50 MiB (GPU 0; 15.75 GiB total capacity; 7.09 GiB already allocated; 20.62 MiB free; 72.48 MiB cached)

It looks like there is enough memory left, yet I get an OOM error. Do we have any update on this ?

Thanks!

Hi I am facing the same issue:
RuntimeError: CUDA out of memory. Tried to allocate 1.86 GiB (GPU 0; 15.75 GiB total capacity; 6.25 GiB already allocated; 8.44 GiB free; 17.78 MiB cached)

Any advances on this? Thanks guys!

run
torch.cuda.empty_cache() to clear the GPU memory then restart the kernel and or close and re-open the kernel