I am using PyTorch through Anaconda Environment and something weird happens. While working or if I leave the machine for some time and come back, PyTorch stops recognizing the GPU. And the only way it starts recognizing the GPU is after rebooting the machine.
You mean torch.cuda.device_count() returns 0? Can you confirm nvidia-smi still works correctly when that happens? And can you also check what is the value for CUDA_VISIBLE_DEVICES env var?
This works. How did you find out this solution? Itās so weird right. Suddenly it stops working. I think thereās some internal functioning of pytorch that changes something
Are any other CUDA applications running fine, i.e. are you able to run some CUDA examples etc.?
Iām not sure, if this is a PyTorch-related issue or rather a CUDA/NVIDIA driver issue.
Just to share my experience (with an old version of pytorch and an
old gpu):
I see something similar to this. If I launch a fresh python session
and run a pytorch script that uses cuda, then if I donāt use cuda
(or maybe just the python session) for a short-ish amount of time,
future use of cuda in that python session fails.
But I donāt have to reboot my machine or āreload the gpuā to get
it working again; I only have to exit and restart python.
I havenāt found any fix for it ā I just live with it, restarting python as
necessary.
Hereās a post of mine with some related observations:
Can you please share the versions of PyTorch and CUDA you are using (and perhaps a GPU type)?
Also, are there any messages printed to the kernel log (can be checked by running dmesg) when this happens?
I see something similar to this. If I launch a fresh python session
and run a pytorch script that uses cuda, then if I donāt use cuda
(or maybe just the python session) for a short-ish amount of time,
future use of cuda in that python session fails.
This is what happens but unfortunately for me, I have either have to restart the machine or reload the GPU. Just restarting python didnāt help. I even tried reloading the conda environment. Itās as if a switch went off and I have to physically switch it on again.
Does working from an anaconda environment affect this? Because the environment wonāt use the CUDA innstalled on the machine but the one dowanloded by anaconda itself.
As you said, the cudatoolkit from the conda binaries will be used and your local CUDA11 installation will thus not be used.
What do you mean by āaffect thisā?
I was referring to pytorch suddenly stops recognising thr GPU by āaffect thisā. So what I wanted to ask if is how to check which CUDA is causing the problem. The one that was installed with anaconda (cudatoolkit) or the one thatās locally installed (Cuda 11).
Thanks for the explanation. As said, the cudatoolkit (shipped via the binaries) will be used.
However, I doubt that CUDA is responsible for this behavior and would first look into potential hardware, driver, PSU issues.
You could check dmesg for any XID errors.