All of a sudden I get the following error when running my pytorch code.
cublas runtime error : an access to GPU memory space failed.
Occasionally I also get:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
I have two 2080Ti GPUs on the machine. The error only happens when I run my code on the gpu with id:1 (the GPU with my monitor connected to it as well). Could this be a hardware issue? I have been able to run the code on either GPU in the past.
Could you check the GPU memory on this device with
nvidia-smi and make sure you are not running out of memory?
I assume the code works fine on the first device?
Yes the program takes about 3Gb of memory on the GPU. The graphics just take 119Mb (Unfortunately due to self-isolation and lockdown I can’t know if the display looks OK). If I make the batch size very small (2) then the program runs (taking 1.5 Gb on the memory).
OK so I started gradually increasing the batch size and now all of a sudden I don’t get the error (??). I even increased it past what I was using originally (it now occupies almost all of 11 Gb) and it works. I wonder if this is an issue caused by overheating? The card is the one on the bottom so it gets 20C higher than the other one.
The error shouldn’t be raised by overheating, but your system / GPU should instead shutdown to prevent damage.
Maybe some zombie process was still using some GPU memory and thus creating an OOM error.
Yeah although the error persisted even after several reboots. Also it happens still now although more spuriously ¯_(ツ)_/¯.