Pytorch not detecting CUDA - Azure NC96ads A100 v4 VM - Ubuntu 22.04

Hi,
I am facing issue in installing and using pytorch in Conda environment on Ubuntu 22.04 OS, which is of “Standard NC96ads A100 v4” in Azure Cloud. Pytorch is unable to detect CUDA that has been installed. Below are the details. Running the command “torch.cuda.is_available()” returns “False” with error CUDA Driver initialization error.

>>> torch.cuda.is_available()
/home/xyz/anaconda/envs/llm/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1670525541990/work/c10/cuda/CUDAFunctions.cpp:109.)  
  return torch._C._cuda_getDeviceCount() > 0
False
>>> torch._C._cuda_getDeviceCount()
0
>>> torch.version.cuda
'11.7'
>>> print(torch._C._cuda_getCompiledVersion(), 'cuda compiled version')
11070 cuda compiled version

I have tried the below to make it work by nothing gave result. Any help will be great.

  1. Uninstalled CUDA and NVIDIA Drives completely and Installed again.
    (I made sure to reboot the VM after every installation of cuda and nvidia drivers.)
  2. Tried multiple versions of Pytorch (1.5.0, 1.12.1, 1.13.1) by Uninstalling and Installing through conda uninstall and conda install.

Conda command used:

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Python Version: 3.9.10
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   34C    P0    44W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000002:00:00.0 Off |                  Off |
| N/A   34C    P0    43W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000003:00:00.0 Off |                  Off |
| N/A   35C    P0    45W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000004:00:00.0 Off |                  Off |
| N/A   34C    P0    42W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
sudo ubuntu-drivers devices
== /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/000000c1-0003-0000-3130-444532304235/pci0003:00/0003:00:00.0 ==
modalias : pci:v000010DEd000020B5sv000010DEsd00001533bc03sc02i00
vendor   : NVIDIA Corporation
manual_install: True
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-525-open - distro non-free recommended
driver   : nvidia-driver-515-server - distro non-free
driver   : nvidia-driver-515-open - distro non-free
driver   : nvidia-driver-525-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

cat /etc/os-release

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
sudo apt list --installed | grep -i cuda

cuda-11-7/unknown, now 11.7.1-1 amd64 [installed]
cuda-cccl-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-command-line-tools-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-compiler-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-cudart-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-cudart-dev-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-cuobjdump-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-cupti-11-7/unknown,now 11.7.101-1 amd64 [installed,automatic]
cuda-cupti-dev-11-7/unknown,now 11.7.101-1 amd64 [installed,automatic]
cuda-cuxxfilt-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-demo-suite-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-documentation-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-driver-dev-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-drivers-525/unknown,now 525.85.12-1 amd64 [installed,automatic]
cuda-drivers/unknown,now 525.85.12-1 amd64 [installed,automatic]
cuda-gdb-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-libraries-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-libraries-dev-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-memcheck-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nsight-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nsight-compute-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-nsight-systems-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-nvcc-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-nvdisasm-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nvml-dev-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nvprof-11-7/unknown,now 11.7.101-1 amd64 [installed,automatic]
cuda-nvprune-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nvrtc-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-nvrtc-dev-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-nvtx-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nvvp-11-7/unknown,now 11.7.101-1 amd64 [installed,automatic]
cuda-runtime-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-sanitizer-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-toolkit-11-7-config-common/unknown,now 11.7.99-1 all [installed,automatic]
cuda-toolkit-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-toolkit-11-config-common/unknown,now 11.8.89-1 all [installed,automatic]
cuda-toolkit-config-common/unknown,now 12.0.146-1 all [installed,automatic]
cuda-tools-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-visual-tools-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
libcudart10.1/focal,now 10.1.243-3 amd64 [installed,automatic]
nvidia-cuda-dev/focal,now 10.1.243-3 amd64 [installed,automatic]
nvidia-cuda-doc/focal,now 10.1.243-3 all [installed,automatic]
nvidia-cuda-gdb/focal,now 10.1.243-3 amd64 [installed,automatic]
nvidia-cuda-toolkit/focal,now 10.1.243-3 amd64 [installed]

Please help me in resolving this issue.

To run CUDA in a virtual machine you would need to install NVIDIA GRID Drivers. I don’t know your exact setup, but Microsoft explains the install steps here and I’m unsure why the Azure image you are using doesn’t come with a preinstalled setup.

You have also installed an old CUDA 10.1 toolkit, which you should remove.

In any case, the error points to a driver issue and is most likely unrelated to PyTorch.

Hi @ptrblck ,

Thanks much for your response. CUDA 10.1 was installed during the installation “nvidia-cuda-toolkit” I did not find other versions of this, without this “nvcc” was not available.

I will try installing NVIDIA GRID drivers and will update you.

Hello @ptrblck ,

Thanks for the suggestion. NVIDIA GRID Drivers update did not work when I tried to Install. I uninstalled the nvidia-cuda-toolkit which is of version 10.1 and added “/usr/local/cuda/11-7/bin” in the bashrc to make “nvcc” command to work. This resolved the issue.

Thanks much once again.

That’s a bit weird, as GRID drivers are needed for VMs, but in any case it’s good to hear you’ve somehow solved the issue.

I verified the config, the GRID drivers were already installed in the VM. I did not verify this earlier!

Thanks for confirming! :slight_smile: