PyTorch-v1.7.1+cu110 - CUDA initialization error

I’m using p4d-24xlarge instance(NVIDIA A100) on AWS with CUDA/drivers showing installed correctly, but torch.cuda doesn’t load up. The instance has been setup using the step here.

Can anyone please tell me what might be causing this? Here are more info about my set up. Thanks in advance!

Error

conda install pytorch torchvision cudatoolkit=11.0 -c pytorch
python -c 'import torch; print(torch.cuda.is_available())'
/home_shared/valentyn/.conda/envs/cuda112/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at  /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
False

Installations
Ubuntu 18.04 LTS

$ uname -a
Linux worker-p4d-24xlarge-spot14 5.4.0-1037-aws #39~18.04.1-Ubuntu SMP Fri Jan 15 02:48:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

CUDA 11.2

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:32:09_PST_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0

torch v1.7.1

>>> import torch; torch.__version__;
'1.7.1+cu110'

NVCC

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:32:09_PST_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0

nvidia-smi

$ nvidia-smi
Thu Feb 25 16:25:54 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:10:1C.0 Off | 0 |
| N/A 44C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB Off | 00000000:10:1D.0 Off | 0 |
| N/A 40C P0 44W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB Off | 00000000:20:1C.0 Off | 0 |
| N/A 42C P0 47W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB Off | 00000000:20:1D.0 Off | 0 |
| N/A 40C P0 45W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB Off | 00000000:90:1C.0 Off | 0 |
| N/A 42C P0 45W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB Off | 00000000:90:1D.0 Off | 0 |
| N/A 40C P0 44W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB Off | 00000000:A0:1C.0 Off | 0 |
| N/A 44C P0 46W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB Off | 00000000:A0:1D.0 Off | 0 |
| N/A 43C P0 45W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Data Center GPU Manager

$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:10:1C.0 |
| | Device UUID: GPU-ee16b34f-512a-99f3-3254-5d13653417eb |
+--------+----------------------------------------------------------------------+
| 1 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:10:1D.0 |
| | Device UUID: GPU-5b4c8387-ed2c-6d6b-5226-9a442a43556f |
+--------+----------------------------------------------------------------------+
| 2 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:20:1C.0 |
| | Device UUID: GPU-08ece2c6-fabb-7616-6970-40e5c4501cc0 |
+--------+----------------------------------------------------------------------+
| 3 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:20:1D.0 |
| | Device UUID: GPU-154ef609-926e-d809-89f6-46c643274c0e |
+--------+----------------------------------------------------------------------+
| 4 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:90:1C.0 |
| | Device UUID: GPU-4be0122e-75a0-6da1-bfcc-f6e814907f35 |
+--------+----------------------------------------------------------------------+
| 5 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:90:1D.0 |
| | Device UUID: GPU-c971220d-fba2-45c5-df64-f54fd8bf1a6c |
+--------+----------------------------------------------------------------------+
| 6 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:A0:1C.0 |
| | Device UUID: GPU-ddca000b-bdff-f303-233c-f92a46596a9a |
+--------+----------------------------------------------------------------------+
| 7 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:A0:1D.0 |
| | Device UUID: GPU-0a31572a-40f9-39cc-09bd-783b9ee3dde1 |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

nvidia packages installed

$ dpkg -l | grep -i nvidia
ii cuda-nsight-compute-11-2 11.2.1-1 amd64 NVIDIA Nsight Compute
ii cuda-nsight-systems-11-2 11.2.1-1 amd64 NVIDIA Nsight Systems
ii cuda-nvtx-11-2 11.2.67-1 amd64 NVIDIA Tools Extension
ii datacenter-gpu-manager 1:2.1.4 amd64 NVIDIA® Datacenter GPU Management Tools
ii libnvidia-cfg1-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-460 460.39-0ubuntu0.18.04.1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.3.3-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.3.3-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-460:amd64 460.39-0ubuntu0.18.04.1 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ifr1-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii nsight-compute-2020.3.1 2020.3.1.3-1 amd64 NVIDIA Nsight Compute
ii nvidia-compute-utils-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA compute utilities
ii nvidia-container-runtime 3.4.2-1 amd64 NVIDIA container runtime
ii nvidia-container-toolkit 1.4.2-1 amd64 NVIDIA container runtime hook
ii nvidia-dkms-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA DKMS package
ii nvidia-docker2 2.5.0-1 all nvidia-docker CLI wrapper
ii nvidia-driver-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-460 460.39-0ubuntu0.18.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA kernel source package
ii nvidia-modprobe 460.32.03-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 440.82-0ubuntu0.18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA driver support binaries
ii xserver-xorg-video-nvidia-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA binary Xorg driver

Thanks!

The error points towards a machine setup failure.
Could you try to compile the CUDA samples and run them?

1 Like

Thank you @ptrblck ,

As you mentioned, CUDA setup itself had issues and is now resolved.

Hello! Would it be possible to tell a bit more about what you did to resolve the previous CUDA setup issue? I’m experiencing exactly the same error, but temporarily don’t have sudo access to the machine. Thank you!

Hi Everyone,

@ptrblck
I’m facing this issue, but it behaves “randomly”. I run the machine and load different models. After some iteration, it throws this error

*UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). *
*Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? *
*Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)*
*  return torch._C._cuda_getDeviceCount() > 0*

Do you have any workaround? How can I fix this issue?

torch==1.7.1
cuda version: cuda-10-2

I would generally recommend to update to the latest PyTorch release and, if possible, also the NVIDIA drivers.
If this error is raised randomly, it could point towards a setup issue where the GPU is dropped.
This could indicate either a broken driver or e.g. overheating of the GPU, which will shut itself down to avoid damage. The output of dmesg might show additional XID error codes, which might be helpful for further debugging.

1 Like