Cublas runtime error : an access to GPU memory space failed

Sfitsos · March 24, 2020, 12:40pm

All of a sudden I get the following error when running my pytorch code.

cublas runtime error : an access to GPU memory space failed.

Occasionally I also get:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered

I have two 2080Ti GPUs on the machine. The error only happens when I run my code on the gpu with id:1 (the GPU with my monitor connected to it as well). Could this be a hardware issue? I have been able to run the code on either GPU in the past.

ptrblck · March 25, 2020, 5:31am

Could you check the GPU memory on this device with nvidia-smi and make sure you are not running out of memory?
I assume the code works fine on the first device?

Sfitsos · March 25, 2020, 8:20am

Yes the program takes about 3Gb of memory on the GPU. The graphics just take 119Mb (Unfortunately due to self-isolation and lockdown I can’t know if the display looks OK). If I make the batch size very small (2) then the program runs (taking 1.5 Gb on the memory).

Sfitsos · March 25, 2020, 8:25am

OK so I started gradually increasing the batch size and now all of a sudden I don’t get the error (??). I even increased it past what I was using originally (it now occupies almost all of 11 Gb) and it works. I wonder if this is an issue caused by overheating? The card is the one on the bottom so it gets 20C higher than the other one.

ptrblck · March 25, 2020, 8:51am

The error shouldn’t be raised by overheating, but your system / GPU should instead shutdown to prevent damage.

Maybe some zombie process was still using some GPU memory and thus creating an OOM error.

Sfitsos · March 25, 2020, 9:33am

Yeah although the error persisted even after several reboots. Also it happens still now although more spuriously ¯_(ツ)_/¯.