I’m using p4d-24xlarge instance(NVIDIA A100) on AWS with CUDA/drivers showing installed correctly, but torch.cuda doesn’t load up. The instance has been setup using the step here.
Can anyone please tell me what might be causing this? Here are more info about my set up. Thanks in advance!
Error
conda install pytorch torchvision cudatoolkit=11.0 -c pytorch
python -c 'import torch; print(torch.cuda.is_available())'
/home_shared/valentyn/.conda/envs/cuda112/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
False
Installations
Ubuntu 18.04 LTS
$ uname -a
Linux worker-p4d-24xlarge-spot14 5.4.0-1037-aws #39~18.04.1-Ubuntu SMP Fri Jan 15 02:48:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
CUDA 11.2
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:32:09_PST_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0
torch v1.7.1
>>> import torch; torch.__version__;
'1.7.1+cu110'
NVCC
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:32:09_PST_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0
nvidia-smi
$ nvidia-smi
Thu Feb 25 16:25:54 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:10:1C.0 Off | 0 |
| N/A 44C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB Off | 00000000:10:1D.0 Off | 0 |
| N/A 40C P0 44W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB Off | 00000000:20:1C.0 Off | 0 |
| N/A 42C P0 47W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB Off | 00000000:20:1D.0 Off | 0 |
| N/A 40C P0 45W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB Off | 00000000:90:1C.0 Off | 0 |
| N/A 42C P0 45W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB Off | 00000000:90:1D.0 Off | 0 |
| N/A 40C P0 44W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB Off | 00000000:A0:1C.0 Off | 0 |
| N/A 44C P0 46W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB Off | 00000000:A0:1D.0 Off | 0 |
| N/A 43C P0 45W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Data Center GPU Manager
$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:10:1C.0 |
| | Device UUID: GPU-ee16b34f-512a-99f3-3254-5d13653417eb |
+--------+----------------------------------------------------------------------+
| 1 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:10:1D.0 |
| | Device UUID: GPU-5b4c8387-ed2c-6d6b-5226-9a442a43556f |
+--------+----------------------------------------------------------------------+
| 2 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:20:1C.0 |
| | Device UUID: GPU-08ece2c6-fabb-7616-6970-40e5c4501cc0 |
+--------+----------------------------------------------------------------------+
| 3 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:20:1D.0 |
| | Device UUID: GPU-154ef609-926e-d809-89f6-46c643274c0e |
+--------+----------------------------------------------------------------------+
| 4 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:90:1C.0 |
| | Device UUID: GPU-4be0122e-75a0-6da1-bfcc-f6e814907f35 |
+--------+----------------------------------------------------------------------+
| 5 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:90:1D.0 |
| | Device UUID: GPU-c971220d-fba2-45c5-df64-f54fd8bf1a6c |
+--------+----------------------------------------------------------------------+
| 6 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:A0:1C.0 |
| | Device UUID: GPU-ddca000b-bdff-f303-233c-f92a46596a9a |
+--------+----------------------------------------------------------------------+
| 7 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:A0:1D.0 |
| | Device UUID: GPU-0a31572a-40f9-39cc-09bd-783b9ee3dde1 |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
nvidia packages installed
$ dpkg -l | grep -i nvidia
ii cuda-nsight-compute-11-2 11.2.1-1 amd64 NVIDIA Nsight Compute
ii cuda-nsight-systems-11-2 11.2.1-1 amd64 NVIDIA Nsight Systems
ii cuda-nvtx-11-2 11.2.67-1 amd64 NVIDIA Tools Extension
ii datacenter-gpu-manager 1:2.1.4 amd64 NVIDIA® Datacenter GPU Management Tools
ii libnvidia-cfg1-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-460 460.39-0ubuntu0.18.04.1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.3.3-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.3.3-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-460:amd64 460.39-0ubuntu0.18.04.1 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ifr1-460:amd64 460.39-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii nsight-compute-2020.3.1 2020.3.1.3-1 amd64 NVIDIA Nsight Compute
ii nvidia-compute-utils-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA compute utilities
ii nvidia-container-runtime 3.4.2-1 amd64 NVIDIA container runtime
ii nvidia-container-toolkit 1.4.2-1 amd64 NVIDIA container runtime hook
ii nvidia-dkms-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA DKMS package
ii nvidia-docker2 2.5.0-1 all nvidia-docker CLI wrapper
ii nvidia-driver-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-460 460.39-0ubuntu0.18.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA kernel source package
ii nvidia-modprobe 460.32.03-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 440.82-0ubuntu0.18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA driver support binaries
ii xserver-xorg-video-nvidia-460 460.39-0ubuntu0.18.04.1 amd64 NVIDIA binary Xorg driver
Thanks!