All of a sudden I get the following error when running my pytorch code.
cublas runtime error : an access to GPU memory space failed.
Occasionally I also get:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
I have two 2080Ti GPUs on the machine. The error only happens when I run my code on the gpu with id:1 (the GPU with my monitor connected to it as well). Could this be a hardware issue? I have been able to run the code on either GPU in the past.
Could you check the GPU memory on this device with nvidia-smi and make sure you are not running out of memory?
I assume the code works fine on the first device?
Yes the program takes about 3Gb of memory on the GPU. The graphics just take 119Mb (Unfortunately due to self-isolation and lockdown I can’t know if the display looks OK). If I make the batch size very small (2) then the program runs (taking 1.5 Gb on the memory).
OK so I started gradually increasing the batch size and now all of a sudden I don’t get the error (??). I even increased it past what I was using originally (it now occupies almost all of 11 Gb) and it works. I wonder if this is an issue caused by overheating? The card is the one on the bottom so it gets 20C higher than the other one.