Torch.cuda.is_available() returns False on ssh server with NVIDIA GPU

ansh7xpex · November 2, 2021, 9:33pm

I am using my Institute GPU through ssh server (Pardon the terms, I am a newbie). I have been trying for so long but PyTorch torch.cuda.is_available() returns False.

Here is the output of nvidia-smi:

I have also run some commands: torch.cuda.is_available() returns False , while torch.backend.cudnn.enabled is True . I have tried with cudatoolkit 10.2 and cudatoolkit 11.1 , as well as cudatoolkit 11.3 also and I am still not being able to make a conda environment. After research I have come to know that the issue is due to driver incompatibility, but I still cannot find the solution.

Any help is appreciated. I am tagging @ptrblck as I am in a desperate need of help. The results here are for PyTorch 1.10.0 and cudatoolkit 10.1.243.

Here is conda list torch.

ptrblck · November 3, 2021, 7:33am

Your A100s need to use CUDA>=11.0, so the binaries with cudatoolkit==10.1.243 won’t work.
I also see that you’ve MIG enabled on the A100s. Is this on purpose? If so, mask the desired MIG instance via CUDA_VISIBLE_DEVICES and is it in your script as multiple MIG instances cannot be used together.

ansh7xpex · November 3, 2021, 8:44am

Thanks for the reply @ptrblck. I created a new environment and installed PyTorch using

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge

But it still shows torch.cuda.is_available() as false and gives a warning.

ansh7xpex · November 3, 2021, 1:08pm

@smth Can you please help?

ptrblck · November 3, 2021, 6:28pm

Is this a new setup, i.e. did you just install or update new drivers?
The last error message claims the driver fails to initialize, which could come from a broken installation or an update without a restart etc.

ansh7xpex · November 3, 2021, 6:32pm

@ptrblck I just got an ID and password from my institute to login. All I’ve done is to try and install Pytorch using conda.

Mohammed · November 4, 2021, 4:25am

Hey @ansh7xpex I ran into a similar problem. You are probably installing the cpu version of pytorch. You need to make sure that you are installing the pytorch+cuda version. It seems that when you check for the pytorch version, you should expect somthing like ‘1.8.0+cu111’ indicating pytorch + cuda.
cuda

I could not get conda command to work because of some connection error when downloading some files. So I ended up downloading the pytorch+cuda whl version on my computer and then using the pip command. You can download the whl from here https://download.pytorch.org/whl/cu111/torch-1.9.0%2Bcu111-cp39-cp39-win_amd64.whl

Here is the pip command: pip install torch-1.8.0+cu111-cp39-cp39-win_amd64.whl torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

You also need to create a new conda environment to avoid some installation errors.

I hope that helps.

ansh7xpex · November 4, 2021, 10:45am

@Mohammed thanks for the reply. I used this command:

pip install torch-1.8.0+cu111-cp39-cp39-win_amd64.whl torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

I do get the output for torch.__version__ you have posted but, the problem persists.

Mohammed · November 4, 2021, 12:38pm

@ansh7xpex did you install it in a new conda environment with python 9 and activate it? I also ran into some issues after running the pip command, but I figured out I needed to create another environment.

As you said earlier, you may have some compatibility issues with the driver. So you may need to upgrade or downgrade into another driver version. I have the latest driver version (496.13) with RTX 3080. I can see that you have 470.57.02. Give it a try and install another driver version.

You also can try ‘nvcc -V’ to check the cuda version that you have. I have V11.5.50.

ansh7xpex · November 4, 2021, 12:41pm

@Mohammed, I created a new conda environment with python 3.8, activated it and ran the command.
I don’t think I can mess with the drivers because I have only a login and password and the GPU is maintained by the IT department at my institute.

Mohammed · November 4, 2021, 1:14pm

@ansh7xpex I see what you mean.

You can check the following link to know more about the driver requirements.
https://docs.nvidia.com/deploy/cuda-compatibility

For cuda 11, the driver should be >= 450.80.02 for linux or >=456.38 for windows.

As @ptrblck said, “Your A100s need to use CUDA>=11.0,so the binaries with cudatoolkit==10.1.243 won’t work.” You probably need to reach out to the IT so that they can upgrade the driver for you.

ansh7xpex · November 4, 2021, 1:44pm

Thanks @Mohammed, I will reach out to them.

ansh7xpex · November 5, 2021, 8:16pm

I finally solved the problem. The problem was not due to drivers or anything. My college senior gave me a piece of code to write at the beginning of the file that I wanted to run. Here is the code:

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import tensorflow as tf
from tensorflow.python.client import device_lib
print(tf.config.list_physical_devices("GPU"))

print("GPU Available: "+str(torch.cuda.is_available())) #To finally check if the GPUs are being detected.

Adding this piece of code helped me to run my model on GPU.

ptrblck · November 5, 2021, 9:41pm

Could you please check if removing the os.environ calls would break your setup again?
If so, I would guess you might have exported CUDA_VISIBLE_DEVICES in your default environment and could check it with echo $CUDA_VISIBLE_DEVICES in the terminal.

ansh7xpex · November 6, 2021, 10:39am

Yes, @ptrblck, removing the os.environ calls breaks my setup again.

ptrblck · November 6, 2021, 6:35pm

That’s interesting. Did you check your environment via echo $CUDA_VISIBLE_DEVICES or export? And if so, do you see any CUDA related env variables set to wrong values?
You should not need to set the os.environ inside a script, as I would consider it quite flaky if you are setting these env vars too late (and the system env vars would be used instead).

ansh7xpex · November 6, 2021, 7:16pm

I typed export in the terminal and found two variables that were related to cuda (I’m a newbie, pardon if I am wrong)

declare -x LD_LIBRARY_PATH="/usr/local/cuda-11.4/lib64:/usr/local/cuda/lib

declare -x PATH="/home/ansh.arora/tmp/ENTER/bin:/home/ansh.arora/tmp/ENTER/condabin:/usr/local/cuda-11.4/bin:/usr/lib64/qt-3.3/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ansh.arora/.local/bin:/home/ansh.arora/bin