I have been working on a server with a single A100 GPU. I installed NVIDIA driver 470.141.03, and nvidia-smi runs fine, showing the expected output, including CUDA Version: 11.4. I then proceeded to create a python venv and install pytorch using the following command:
When I launch python and run the torch.cuda.is_available command I get the following output:
/home/azureuser/michael/ml-toolkit-env/lib/python3.8/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
False
When I run torch.version.cuda I get 11.3, as expected.
Has anyone run across a similar issue in the past? Any suggestions on how to resolve this?
More detail: this is happening on a cloud VM in Azure.
It seems you’re correct, the CUDA samples won’t work either. For example, running deviceQuery fails with error code 3. According to this document it would appear that the CUDA driver and runtime could not be initialized. Considering the fact that we have a compatible driver installed as confirmed by nvidia-smi, could this indicate a hardware failure? Or could it also be something else?
We have tried to reinstall the driver multiple times, and tried several versions (515, 520, 450). All seem to result in failure. We are using this as our base image. It’s an official Azure Ubuntu 20.04 image that comes pre-packages with many of the libraries required for HPC, and comes pre-installed with what ostensibly is a correct driver and CUDA combination. We have tried several driver versions, restoring it to the original image after each unsuccessful attempt. The error code is always identical.
Is there anything besides a driver reinstall we could try?