Hi,
I am facing issue in installing and using pytorch in Conda environment on Ubuntu 22.04 OS, which is of “Standard NC96ads A100 v4” in Azure Cloud. Pytorch is unable to detect CUDA that has been installed. Below are the details. Running the command “torch.cuda.is_available()” returns “False” with error CUDA Driver initialization error.
>>> torch.cuda.is_available()
/home/xyz/anaconda/envs/llm/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1670525541990/work/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
>>> torch._C._cuda_getDeviceCount()
0
>>> torch.version.cuda
'11.7'
>>> print(torch._C._cuda_getCompiledVersion(), 'cuda compiled version')
11070 cuda compiled version
I have tried the below to make it work by nothing gave result. Any help will be great.
- Uninstalled CUDA and NVIDIA Drives completely and Installed again.
(I made sure to reboot the VM after every installation of cuda and nvidia drivers.) - Tried multiple versions of Pytorch (1.5.0, 1.12.1, 1.13.1) by Uninstalling and Installing through conda uninstall and conda install.
Conda command used:
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Python Version: 3.9.10
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000001:00:00.0 Off | Off |
| N/A 34C P0 44W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000002:00:00.0 Off | Off |
| N/A 34C P0 43W / 300W | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... On | 00000003:00:00.0 Off | Off |
| N/A 35C P0 45W / 300W | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... On | 00000004:00:00.0 Off | Off |
| N/A 34C P0 42W / 300W | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
sudo ubuntu-drivers devices
== /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/000000c1-0003-0000-3130-444532304235/pci0003:00/0003:00:00.0 ==
modalias : pci:v000010DEd000020B5sv000010DEsd00001533bc03sc02i00
vendor : NVIDIA Corporation
manual_install: True
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-525-open - distro non-free recommended
driver : nvidia-driver-515-server - distro non-free
driver : nvidia-driver-515-open - distro non-free
driver : nvidia-driver-525-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
sudo apt list --installed | grep -i cuda
cuda-11-7/unknown, now 11.7.1-1 amd64 [installed]
cuda-cccl-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-command-line-tools-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-compiler-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-cudart-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-cudart-dev-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-cuobjdump-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-cupti-11-7/unknown,now 11.7.101-1 amd64 [installed,automatic]
cuda-cupti-dev-11-7/unknown,now 11.7.101-1 amd64 [installed,automatic]
cuda-cuxxfilt-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-demo-suite-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-documentation-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-driver-dev-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-drivers-525/unknown,now 525.85.12-1 amd64 [installed,automatic]
cuda-drivers/unknown,now 525.85.12-1 amd64 [installed,automatic]
cuda-gdb-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-libraries-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-libraries-dev-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-memcheck-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nsight-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nsight-compute-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-nsight-systems-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-nvcc-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-nvdisasm-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nvml-dev-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nvprof-11-7/unknown,now 11.7.101-1 amd64 [installed,automatic]
cuda-nvprune-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nvrtc-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-nvrtc-dev-11-7/unknown,now 11.7.99-1 amd64 [installed,automatic]
cuda-nvtx-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-nvvp-11-7/unknown,now 11.7.101-1 amd64 [installed,automatic]
cuda-runtime-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-sanitizer-11-7/unknown,now 11.7.91-1 amd64 [installed,automatic]
cuda-toolkit-11-7-config-common/unknown,now 11.7.99-1 all [installed,automatic]
cuda-toolkit-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-toolkit-11-config-common/unknown,now 11.8.89-1 all [installed,automatic]
cuda-toolkit-config-common/unknown,now 12.0.146-1 all [installed,automatic]
cuda-tools-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
cuda-visual-tools-11-7/unknown,now 11.7.1-1 amd64 [installed,automatic]
libcudart10.1/focal,now 10.1.243-3 amd64 [installed,automatic]
nvidia-cuda-dev/focal,now 10.1.243-3 amd64 [installed,automatic]
nvidia-cuda-doc/focal,now 10.1.243-3 all [installed,automatic]
nvidia-cuda-gdb/focal,now 10.1.243-3 amd64 [installed,automatic]
nvidia-cuda-toolkit/focal,now 10.1.243-3 amd64 [installed]
Please help me in resolving this issue.