RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

Abdellah_Bousouf · December 28, 2024, 3:57pm

Hello,

i’m woring with #kubernetes so i install nvidia driver 470, then i install cuda toolkit 11.8 in my worker in kubernities

i’m trying to run a python code in my pod to test my GPUs, so first i install the following libraries

• pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

this is the code that i’m trying to compile:

import torch
import time

def test_gpu():

Check if GPU is available

if not torch.cuda.is_available():
print(“No GPU detected. Please ensure PyTorch is installed with GPU support and a compatible GPU is available.”)
return

print(“GPU detected! Details:”)
print(f"GPU Name: {torch.cuda.get_device_name(0)}“)
print(f"CUDA Capability: {torch.cuda.get_device_capability(0)}”)
print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Perform a simple computation to stress test the GPU

print(“\nStarting GPU stress test…”)
matrix_size = 50
print(f"Matrix size: {matrix_size} x {matrix_size}")

Create random matrices on GPU

device = torch.device(“cuda”)
a = torch.rand((matrix_size, matrix_size), device=device)
b = torch.rand((matrix_size, matrix_size), device=device)

Perform matrix multiplication

start_time = time.time()
result = torch.mm(a, b)
torch.cuda.synchronize() # Ensure computation is finished
end_time = time.time()

print(“GPU stress test completed.”)
print(f"Time taken for matrix multiplication: {end_time - start_time:.2f} seconds")

if name == “main”:
test_gpu()

###################
this is the result:
###################

(base) jovyan@testgpu2222222-0:~$ python /home/jovyan/test2.py
GPU detected! Details:
GPU Name: NVIDIA A10-6C
CUDA Capability: (8, 6)
Total Memory: 6.27 GB
Starting GPU stress test…
Matrix size: 50 x 50

File “/home/jovyan/test2.py”, line 22, in test_gpu
a = torch.rand((matrix_size, matrix_size), device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

as u can see the pod can detect gpu, but while running the python code inside the pod we get this error:

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

Even no process is running and the GPU is in the default mode.

johann-petrak · April 17, 2025, 7:39am

Getting this intermittently again and again too and there is no indication what could be going on as the GPU is not actually used that much: only 1 of 4G are used according to nvidia-smi
This is extremely annoying and I have found quite a few reports on the internet where people have the same problem but no real reproducable solution or even an explanation of what is going on.