No process using GPU, but `CUDA error: all CUDA-capable devices are busy or unavailable`

nkla · November 13, 2020, 8:11pm

Thanks ptrblck, granth_jain.

When I investigate dmesg,

# dmesg|grep "NVRM"
[    9.976755] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  455.38  Thu Oct 22 06:06:59 UTC 2020

I noticed if I run the torch methods which error out, I get the following

[43613.854296] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[43613.854575] nvidia_uvm: Unknown symbol radix_tree_preloads (err -2)

This was caused by Nvidia incompatibility with Kernel 5.9. I downgraded from 5.9 to 5.8, and the errors are resolved.

I applied the fixes to both computers and the errors are resolved.

However, my Quardo RTX 5000 machine is encountering another error, where

$ python3 -c 'import torch; torch.randn(1).to(0)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

I verified that no process is using the GPU,

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:04:00.0 Off |                  Off |
| 33%   26C    P8     6W / 230W |      1MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and the compute mode is in default, not exclusive,

$ nvidia-smi -a | grep Compute
    Compute Mode                          : Default

This is running inside a kvm hypervisor with vfio passthrough, and I verified that nvidia driver is attached to the GPU

04:00.0 VGA compatible controller: NVIDIA Corporation TU104GL [Quadro RTX 5000] (rev a1)
        Subsystem: Dell TU104GL [Quadro RTX 5000]
        Kernel driver in use: nvidia
        Kernel modules: nvidia
05:00.0 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
        Subsystem: Dell TU104 HD Audio Controller
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
06:00.0 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
        Subsystem: Dell TU104 USB 3.1 Host Controller
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
07:00.0 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
        Subsystem: Dell TU104 USB Type-C UCSI Controller

I attempted to

Remove all nvidia-* packages and reinstall nvidia-driver (tried both 450.80 and 455.38) and nvidia-cuda-toolkit (11.0)
Reinstall Pytorch for Cuda 11.0 using miniconda and pip.

The server is headless, and no desktop environment was installed. Thus, there should be no graphics-based processes using the gpu.

Do you know what is causing this issue? Can this be caused by VFIO, although everything seem to be in order? Thanks!

More tests
I downloaded and compiled the script to test cuda functionality. The output shows error code 201 for cMemGetInfo.

$ ./cuda_check
Found 1 device(s).
Device: 0
  Name: Quadro RTX 5000
  Compute Capability: 7.5
  Multiprocessors: 48
  CUDA Cores: 3072
  Concurrent threads: 49152
  GPU clock: 1815 MHz
  Memory clock: 7001 MHz
  cMemGetInfo failed with error code 201: invalid device context