Torch.cuda.is_available() returns False on ssh server with NVIDIA GPU

I am using my Institute GPU through ssh server (Pardon the terms, I am a newbie). I have been trying for so long but PyTorch torch.cuda.is_available() returns False.

Here is the output of nvidia-smi:

I have also run some commands: torch.cuda.is_available() returns False , while torch.backend.cudnn.enabled is True . I have tried with cudatoolkit 10.2 and cudatoolkit 11.1 , as well as cudatoolkit 11.3 also and I am still not being able to make a conda environment. After research I have come to know that the issue is due to driver incompatibility, but I still cannot find the solution.

Any help is appreciated. I am tagging @ptrblck as I am in a desperate need of help. The results here are for PyTorch 1.10.0 and cudatoolkit 10.1.243.

enter image description here

Here is conda list torch.

Your A100s need to use CUDA>=11.0, so the binaries with cudatoolkit==10.1.243 won’t work.
I also see that you’ve MIG enabled on the A100s. Is this on purpose? If so, mask the desired MIG instance via CUDA_VISIBLE_DEVICES and is it in your script as multiple MIG instances cannot be used together.

Thanks for the reply @ptrblck. I created a new environment and installed PyTorch using

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge

But it still shows torch.cuda.is_available() as false and gives a warning.

@smth Can you please help?

Is this a new setup, i.e. did you just install or update new drivers?
The last error message claims the driver fails to initialize, which could come from a broken installation or an update without a restart etc.

@ptrblck I just got an ID and password from my institute to login. All I’ve done is to try and install Pytorch using conda.

Hey @ansh7xpex I ran into a similar problem. You are probably installing the cpu version of pytorch. You need to make sure that you are installing the pytorch+cuda version. It seems that when you check for the pytorch version, you should expect somthing like ‘1.8.0+cu111’ indicating pytorch + cuda.
cuda

I could not get conda command to work because of some connection error when downloading some files. So I ended up downloading the pytorch+cuda whl version on my computer and then using the pip command. You can download the whl from here https://download.pytorch.org/whl/cu111/torch-1.9.0%2Bcu111-cp39-cp39-win_amd64.whl

Here is the pip command: pip install torch-1.8.0+cu111-cp39-cp39-win_amd64.whl torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

You also need to create a new conda environment to avoid some installation errors.

I hope that helps.

@Mohammed thanks for the reply. I used this command:

pip install torch-1.8.0+cu111-cp39-cp39-win_amd64.whl torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

I do get the output for torch.__version__ you have posted but, the problem persists.

@ansh7xpex did you install it in a new conda environment with python 9 and activate it? I also ran into some issues after running the pip command, but I figured out I needed to create another environment.

As you said earlier, you may have some compatibility issues with the driver. So you may need to upgrade or downgrade into another driver version. I have the latest driver version (496.13) with RTX 3080. I can see that you have 470.57.02. Give it a try and install another driver version.

You also can try ‘nvcc -V’ to check the cuda version that you have. I have V11.5.50.

@Mohammed, I created a new conda environment with python 3.8, activated it and ran the command.
I don’t think I can mess with the drivers because I have only a login and password and the GPU is maintained by the IT department at my institute.

@ansh7xpex I see what you mean.

You can check the following link to know more about the driver requirements.
https://docs.nvidia.com/deploy/cuda-compatibility

For cuda 11, the driver should be >= 450.80.02 for linux or >=456.38 for windows.

As @ptrblck said, “Your A100s need to use CUDA>=11.0,so the binaries with cudatoolkit==10.1.243 won’t work.” You probably need to reach out to the IT so that they can upgrade the driver for you.

Thanks @Mohammed, I will reach out to them.

I finally solved the problem. The problem was not due to drivers or anything. My college senior gave me a piece of code to write at the beginning of the file that I wanted to run. Here is the code:

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import tensorflow as tf
from tensorflow.python.client import device_lib
print(tf.config.list_physical_devices("GPU"))

print("GPU Available: "+str(torch.cuda.is_available())) #To finally check if the GPUs are being detected.

Adding this piece of code helped me to run my model on GPU.

Could you please check if removing the os.environ calls would break your setup again?
If so, I would guess you might have exported CUDA_VISIBLE_DEVICES in your default environment and could check it with echo $CUDA_VISIBLE_DEVICES in the terminal.

Yes, @ptrblck, removing the os.environ calls breaks my setup again.

That’s interesting. Did you check your environment via echo $CUDA_VISIBLE_DEVICES or export? And if so, do you see any CUDA related env variables set to wrong values?
You should not need to set the os.environ inside a script, as I would consider it quite flaky if you are setting these env vars too late (and the system env vars would be used instead).

I typed export in the terminal and found two variables that were related to cuda (I’m a newbie, pardon if I am wrong)

declare -x LD_LIBRARY_PATH="/usr/local/cuda-11.4/lib64:/usr/local/cuda/lib

declare -x PATH="/home/ansh.arora/tmp/ENTER/bin:/home/ansh.arora/tmp/ENTER/condabin:/usr/local/cuda-11.4/bin:/usr/lib64/qt-3.3/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ansh.arora/.local/bin:/home/ansh.arora/bin