Thanks ptrblck, granth_jain.
When I investigate dmesg,
# dmesg|grep "NVRM"
[ 9.976755] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 455.38 Thu Oct 22 06:06:59 UTC 2020
I noticed if I run the torch methods which error out, I get the following
[43613.854296] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[43613.854575] nvidia_uvm: Unknown symbol radix_tree_preloads (err -2)
This was caused by Nvidia incompatibility with Kernel 5.9. I downgraded from 5.9 to 5.8, and the errors are resolved.
I applied the fixes to both computers and the errors are resolved.
However, my Quardo RTX 5000 machine is encountering another error, where
$ python3 -c 'import torch; torch.randn(1).to(0)'
Traceback (most recent call last):
File "<string>", line 1, in <module>
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
I verified that no process is using the GPU,
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 On | 00000000:04:00.0 Off | Off |
| 33% 26C P8 6W / 230W | 1MiB / 16125MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
and the compute mode is in default, not exclusive,
$ nvidia-smi -a | grep Compute
Compute Mode : Default
This is running inside a kvm hypervisor with vfio passthrough, and I verified that nvidia driver is attached to the GPU
04:00.0 VGA compatible controller: NVIDIA Corporation TU104GL [Quadro RTX 5000] (rev a1)
Subsystem: Dell TU104GL [Quadro RTX 5000]
Kernel driver in use: nvidia
Kernel modules: nvidia
05:00.0 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
Subsystem: Dell TU104 HD Audio Controller
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
06:00.0 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
Subsystem: Dell TU104 USB 3.1 Host Controller
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
07:00.0 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
Subsystem: Dell TU104 USB Type-C UCSI Controller
I attempted to
- Remove all
nvidia-*packages and reinstallnvidia-driver(tried both 450.80 and 455.38) andnvidia-cuda-toolkit(11.0) - Reinstall Pytorch for Cuda 11.0 using miniconda and pip.
The server is headless, and no desktop environment was installed. Thus, there should be no graphics-based processes using the gpu.
Do you know what is causing this issue? Can this be caused by VFIO, although everything seem to be in order? Thanks!
More tests
I downloaded and compiled the script to test cuda functionality. The output shows error code 201 for cMemGetInfo.
$ ./cuda_check
Found 1 device(s).
Device: 0
Name: Quadro RTX 5000
Compute Capability: 7.5
Multiprocessors: 48
CUDA Cores: 3072
Concurrent threads: 49152
GPU clock: 1815 MHz
Memory clock: 7001 MHz
cMemGetInfo failed with error code 201: invalid device context