I have been working on a server with a single A100 GPU. I installed NVIDIA driver 470.141.03, and nvidia-smi runs fine, showing the expected output, including CUDA Version: 11.4. I then proceeded to create a python venv and install pytorch using the following command:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113.
When I launch python and run the
torch.cuda.is_available command I get the following output:
/home/azureuser/michael/ml-toolkit-env/lib/python3.8/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
When I run
torch.version.cuda I get 11.3, as expected.
Has anyone run across a similar issue in the past? Any suggestions on how to resolve this?
More detail: this is happening on a cloud VM in Azure.
Your setup seems to have trouble communicating with the NVIDIA driver so make sure you can compile and execute e.g. the CUDA samples first.
It seems you’re correct, the CUDA samples won’t work either. For example, running deviceQuery fails with error code 3. According to this document it would appear that the CUDA driver and runtime could not be initialized. Considering the fact that we have a compatible driver installed as confirmed by nvidia-smi, could this indicate a hardware failure? Or could it also be something else?
Thanks for the help!
I doubt it’s a hardware defect and would probably just reinstall the driver.
We have tried to reinstall the driver multiple times, and tried several versions (515, 520, 450). All seem to result in failure. We are using this as our base image. It’s an official Azure Ubuntu 20.04 image that comes pre-packages with many of the libraries required for HPC, and comes pre-installed with what ostensibly is a correct driver and CUDA combination. We have tried several driver versions, restoring it to the original image after each unsuccessful attempt. The error code is always identical.
Is there anything besides a driver reinstall we could try?
As It seems to be quite system-specific and unrelated to PyTorch, I would probably contact your system admin or vendor who maintains these nodes.
Understood, thanks! I will post here what I discover after I work through this error.
We have resolved the issue. It appears that having MIG enabled was causing the issue. We have disabled it using the following commands and it worked:
sudo nvidia-smi -i 0 -mig 0
sudo nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv
If anyone comes across a similar issue, this MAY help.