RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

Hello,

i’m woring with #kubernetes so i install nvidia driver 470, then i install cuda toolkit 11.8 in my worker in kubernities

i’m trying to run a python code in my pod to test my GPUs, so first i install the following libraries

• pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

this is the code that i’m trying to compile:

import torch
import time

def test_gpu():

Check if GPU is available

if not torch.cuda.is_available():
print(“No GPU detected. Please ensure PyTorch is installed with GPU support and a compatible GPU is available.”)
return

print(“GPU detected! Details:”)
print(f"GPU Name: {torch.cuda.get_device_name(0)}“)
print(f"CUDA Capability: {torch.cuda.get_device_capability(0)}”)
print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Perform a simple computation to stress test the GPU

print(“\nStarting GPU stress test…”)
matrix_size = 50
print(f"Matrix size: {matrix_size} x {matrix_size}")

Create random matrices on GPU

device = torch.device(“cuda”)
a = torch.rand((matrix_size, matrix_size), device=device)
b = torch.rand((matrix_size, matrix_size), device=device)

Perform matrix multiplication

start_time = time.time()
result = torch.mm(a, b)
torch.cuda.synchronize() # Ensure computation is finished
end_time = time.time()

print(“GPU stress test completed.”)
print(f"Time taken for matrix multiplication: {end_time - start_time:.2f} seconds")

if name == “main”:
test_gpu()

###################
this is the result:
###################

(base) jovyan@testgpu2222222-0:~$ python /home/jovyan/test2.py
GPU detected! Details:
GPU Name: NVIDIA A10-6C
CUDA Capability: (8, 6)
Total Memory: 6.27 GB
Starting GPU stress test…
Matrix size: 50 x 50

File “/home/jovyan/test2.py”, line 22, in test_gpu
a = torch.rand((matrix_size, matrix_size), device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

as u can see the pod can detect gpu, but while running the python code inside the pod we get this error:

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

Even no process is running and the GPU is in the default mode.

Getting this intermittently again and again too and there is no indication what could be going on as the GPU is not actually used that much: only 1 of 4G are used according to nvidia-smi
This is extremely annoying and I have found quite a few reports on the internet where people have the same problem but no real reproducable solution or even an explanation of what is going on.