Pytorch suddenyl stops recognising GPU

Flock1 · August 19, 2020, 4:00pm

Hi,

I am using PyTorch through Anaconda Environment and something weird happens. While working or if I leave the machine for some time and come back, PyTorch stops recognizing the GPU. And the only way it starts recognizing the GPU is after rebooting the machine.

Why does this happen?

mrshenli · August 19, 2020, 4:15pm

You mean torch.cuda.device_count() returns 0? Can you confirm nvidia-smi still works correctly when that happens? And can you also check what is the value for CUDA_VISIBLE_DEVICES env var?

Flock1 · August 19, 2020, 4:45pm

Hi,

Yeah. torch.cuda.device_count() returns 0 and torch.cuda.current_device() returns the following:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1591914886554/work/aten/src/THC/THCGeneral.cpp line=47 error=999 : unknown error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/anaconda3/envs/work/lib/python3.8/site-packages/torch/cuda/__init__.py", line 330, in current_device
    _lazy_init()
  File "/home/user/anaconda3/envs/work/lib/python3.8/site-packages/torch/cuda/__init__.py", line 153, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at /opt/conda/conda-bld/pytorch_1591914886554/work/aten/src/THC/THCGeneral.cpp:47

nvidia-smi works. For CUDA_VISIBLE_DEVICES, I get nothing.

user_123454321 · August 19, 2020, 4:49pm

This happens to me sometimes and to fix without rebooting I reload gpu using

$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm

No idea why it happens though

Flock1 · August 19, 2020, 5:30pm

This works. How did you find out this solution? It’s so weird right. Suddenly it stops working. I think there’s some internal functioning of pytorch that changes something

user_123454321 · August 19, 2020, 6:23pm

Hmmm…since it worked on rebooting my laptop, I guessed it should work by just reloading the gpu. So, searched online on how to reboot nvidia gpu.

Flock1 · August 19, 2020, 6:30pm

Thanks for the solution. This has been bothering me for quite some time now. I’m sure they’ll fix this in later versions.

ptrblck · August 20, 2020, 8:39am

Are any other CUDA applications running fine, i.e. are you able to run some CUDA examples etc.?
I’m not sure, if this is a PyTorch-related issue or rather a CUDA/NVIDIA driver issue.

Flock1 · August 20, 2020, 9:55am

I didn’t check that unfortunately. I can try checking with Keras if that lobrary is unable to recognize the GPU.

I’ll also try running CUDA examples from within the environment and outside it.

KFrank · August 20, 2020, 1:52pm

Hello Flock!

Just to share my experience (with an old version of pytorch and an
old gpu):

I see something similar to this. If I launch a fresh python session
and run a pytorch script that uses cuda, then if I don’t use cuda
(or maybe just the python session) for a short-ish amount of time,
future use of cuda in that python session fails.

But I don’t have to reboot my machine or “reload the gpu” to get
it working again; I only have to exit and restart python.

I haven’t found any fix for it – I just live with it, restarting python as
necessary.

Here’s a post of mine with some related observations:

Best.

K. Frank

malfet · August 20, 2020, 5:25pm

Can you please share the versions of PyTorch and CUDA you are using (and perhaps a GPU type)?
Also, are there any messages printed to the kernel log (can be checked by running dmesg) when this happens?

Flock1 · August 21, 2020, 5:03am

Hey Frank,

Thank you for sharing your experience.

I see something similar to this. If I launch a fresh python session
and run a pytorch script that uses cuda, then if I don’t use cuda
(or maybe just the python session) for a short-ish amount of time,
future use of cuda in that python session fails.

This is what happens but unfortunately for me, I have either have to restart the machine or reload the GPU. Just restarting python didn’t help. I even tried reloading the conda environment. It’s as if a switch went off and I have to physically switch it on again.

Flock1 · August 21, 2020, 5:05am

I use PyTorch through conda environment.
PyTorch: 1.5.1
Cuda tool kit: 10.1.243

On my machine, I have CUDA 11 for RTX 2070 Super GPU

Flock1 · August 28, 2020, 3:49pm

Does working from an anaconda environment affect this? Because the environment won’t use the CUDA innstalled on the machine but the one dowanloded by anaconda itself.

ptrblck · August 28, 2020, 11:54pm

As you said, the cudatoolkit from the conda binaries will be used and your local CUDA11 installation will thus not be used.
What do you mean by “affect this”?

Flock1 · August 30, 2020, 8:21pm

I was referring to pytorch suddenly stops recognising thr GPU by ‘affect this’. So what I wanted to ask if is how to check which CUDA is causing the problem. The one that was installed with anaconda (cudatoolkit) or the one that’s locally installed (Cuda 11).

ptrblck · August 30, 2020, 11:26pm

Thanks for the explanation. As said, the cudatoolkit (shipped via the binaries) will be used.
However, I doubt that CUDA is responsible for this behavior and would first look into potential hardware, driver, PSU issues.
You could check dmesg for any XID errors.

Mehran_Rafiee · November 23, 2020, 8:20pm

Worked! such a waste of time.