Pytorch suddenyl stops recognising GPU

Hi,

I am using PyTorch through Anaconda Environment and something weird happens. While working or if I leave the machine for some time and come back, PyTorch stops recognizing the GPU. And the only way it starts recognizing the GPU is after rebooting the machine.

Why does this happen?

You mean torch.cuda.device_count() returns 0? Can you confirm nvidia-smi still works correctly when that happens? And can you also check what is the value for CUDA_VISIBLE_DEVICES env var?

Hi,

Yeah. torch.cuda.device_count() returns 0 and torch.cuda.current_device() returns the following:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1591914886554/work/aten/src/THC/THCGeneral.cpp line=47 error=999 : unknown error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/anaconda3/envs/work/lib/python3.8/site-packages/torch/cuda/__init__.py", line 330, in current_device
    _lazy_init()
  File "/home/user/anaconda3/envs/work/lib/python3.8/site-packages/torch/cuda/__init__.py", line 153, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at /opt/conda/conda-bld/pytorch_1591914886554/work/aten/src/THC/THCGeneral.cpp:47

nvidia-smi works. For CUDA_VISIBLE_DEVICES, I get nothing.

This happens to me sometimes and to fix without rebooting I reload gpu using

$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm

No idea why it happens though

2 Likes

This works. How did you find out this solution? It’s so weird right. Suddenly it stops working. I think there’s some internal functioning of pytorch that changes something

Hmmm…since it worked on rebooting my laptop, I guessed it should work by just reloading the gpu. So, searched online on how to reboot nvidia gpu.

Thanks for the solution. This has been bothering me for quite some time now. I’m sure they’ll fix this in later versions.

Are any other CUDA applications running fine, i.e. are you able to run some CUDA examples etc.?
I’m not sure, if this is a PyTorch-related issue or rather a CUDA/NVIDIA driver issue.

1 Like

I didn’t check that unfortunately. I can try checking with Keras if that lobrary is unable to recognize the GPU.

I’ll also try running CUDA examples from within the environment and outside it.

Hello Flock!

Just to share my experience (with an old version of pytorch and an
old gpu):

I see something similar to this. If I launch a fresh python session
and run a pytorch script that uses cuda, then if I don’t use cuda
(or maybe just the python session) for a short-ish amount of time,
future use of cuda in that python session fails.

But I don’t have to reboot my machine or “reload the gpu” to get
it working again; I only have to exit and restart python.

I haven’t found any fix for it – I just live with it, restarting python as
necessary.

Here’s a post of mine with some related observations:

Best.

K. Frank

Can you please share the versions of PyTorch and CUDA you are using (and perhaps a GPU type)?
Also, are there any messages printed to the kernel log (can be checked by running dmesg) when this happens?

Hey Frank,

Thank you for sharing your experience.

I see something similar to this. If I launch a fresh python session
and run a pytorch script that uses cuda, then if I don’t use cuda
(or maybe just the python session) for a short-ish amount of time,
future use of cuda in that python session fails.

This is what happens but unfortunately for me, I have either have to restart the machine or reload the GPU. Just restarting python didn’t help. I even tried reloading the conda environment. It’s as if a switch went off and I have to physically switch it on again.

I use PyTorch through conda environment.
PyTorch: 1.5.1
Cuda tool kit: 10.1.243

On my machine, I have CUDA 11 for RTX 2070 Super GPU

Does working from an anaconda environment affect this? Because the environment won’t use the CUDA innstalled on the machine but the one dowanloded by anaconda itself.

As you said, the cudatoolkit from the conda binaries will be used and your local CUDA11 installation will thus not be used.
What do you mean by “affect this”?

I was referring to pytorch suddenly stops recognising thr GPU by ‘affect this’. So what I wanted to ask if is how to check which CUDA is causing the problem. The one that was installed with anaconda (cudatoolkit) or the one that’s locally installed (Cuda 11).

Thanks for the explanation. As said, the cudatoolkit (shipped via the binaries) will be used.
However, I doubt that CUDA is responsible for this behavior and would first look into potential hardware, driver, PSU issues.
You could check dmesg for any XID errors.