I am using my Institute GPU through ssh server (Pardon the terms, I am a newbie). I have been trying for so long but PyTorch torch.cuda.is_available() returns False.
I have also run some commands: torch.cuda.is_available() returns False , while torch.backend.cudnn.enabled is True . I have tried with cudatoolkit 10.2 and cudatoolkit 11.1 , as well as cudatoolkit 11.3 also and I am still not being able to make a conda environment. After research I have come to know that the issue is due to driver incompatibility, but I still cannot find the solution.
Any help is appreciated. I am tagging @ptrblck as I am in a desperate need of help. The results here are for PyTorch 1.10.0 and cudatoolkit 10.1.243.
Your A100s need to use CUDA>=11.0, so the binaries with cudatoolkit==10.1.243 won’t work.
I also see that you’ve MIG enabled on the A100s. Is this on purpose? If so, mask the desired MIG instance via CUDA_VISIBLE_DEVICES and is it in your script as multiple MIG instances cannot be used together.
Is this a new setup, i.e. did you just install or update new drivers?
The last error message claims the driver fails to initialize, which could come from a broken installation or an update without a restart etc.
Hey @ansh7xpex I ran into a similar problem. You are probably installing the cpu version of pytorch. You need to make sure that you are installing the pytorch+cuda version. It seems that when you check for the pytorch version, you should expect somthing like ‘1.8.0+cu111’ indicating pytorch + cuda.
@ansh7xpex did you install it in a new conda environment with python 9 and activate it? I also ran into some issues after running the pip command, but I figured out I needed to create another environment.
As you said earlier, you may have some compatibility issues with the driver. So you may need to upgrade or downgrade into another driver version. I have the latest driver version (496.13) with RTX 3080. I can see that you have 470.57.02. Give it a try and install another driver version.
You also can try ‘nvcc -V’ to check the cuda version that you have. I have V11.5.50.
@Mohammed, I created a new conda environment with python 3.8, activated it and ran the command.
I don’t think I can mess with the drivers because I have only a login and password and the GPU is maintained by the IT department at my institute.
For cuda 11, the driver should be >= 450.80.02 for linux or >=456.38 for windows.
As @ptrblck said, “Your A100s need to use CUDA>=11.0,so the binaries with cudatoolkit==10.1.243 won’t work.” You probably need to reach out to the IT so that they can upgrade the driver for you.
I finally solved the problem. The problem was not due to drivers or anything. My college senior gave me a piece of code to write at the beginning of the file that I wanted to run. Here is the code:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import tensorflow as tf
from tensorflow.python.client import device_lib
print(tf.config.list_physical_devices("GPU"))
print("GPU Available: "+str(torch.cuda.is_available())) #To finally check if the GPUs are being detected.
Adding this piece of code helped me to run my model on GPU.
Could you please check if removing the os.environ calls would break your setup again?
If so, I would guess you might have exported CUDA_VISIBLE_DEVICES in your default environment and could check it with echo $CUDA_VISIBLE_DEVICES in the terminal.
That’s interesting. Did you check your environment via echo $CUDA_VISIBLE_DEVICES or export? And if so, do you see any CUDA related env variables set to wrong values?
You should not need to set the os.environ inside a script, as I would consider it quite flaky if you are setting these env vars too late (and the system env vars would be used instead).