Hello,
i’m woring with #kubernetes so i install nvidia driver 470, then i install cuda toolkit 11.8 in my worker in kubernities
i’m trying to run a python code in my pod to test my GPUs, so first i install the following libraries
• pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
this is the code that i’m trying to compile:
import torch
import time
def test_gpu():
Check if GPU is available
if not torch.cuda.is_available():
print(“No GPU detected. Please ensure PyTorch is installed with GPU support and a compatible GPU is available.”)
return
print(“GPU detected! Details:”)
print(f"GPU Name: {torch.cuda.get_device_name(0)}“)
print(f"CUDA Capability: {torch.cuda.get_device_capability(0)}”)
print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
Perform a simple computation to stress test the GPU
print(“\nStarting GPU stress test…”)
matrix_size = 50
print(f"Matrix size: {matrix_size} x {matrix_size}")
Create random matrices on GPU
device = torch.device(“cuda”)
a = torch.rand((matrix_size, matrix_size), device=device)
b = torch.rand((matrix_size, matrix_size), device=device)
Perform matrix multiplication
start_time = time.time()
result = torch.mm(a, b)
torch.cuda.synchronize() # Ensure computation is finished
end_time = time.time()
print(“GPU stress test completed.”)
print(f"Time taken for matrix multiplication: {end_time - start_time:.2f} seconds")
if name == “main”:
test_gpu()
###################
this is the result:
###################
(base) jovyan@testgpu2222222-0:~$ python /home/jovyan/test2.py
GPU detected! Details:
GPU Name: NVIDIA A10-6C
CUDA Capability: (8, 6)
Total Memory: 6.27 GB
Starting GPU stress test…
Matrix size: 50 x 50
File “/home/jovyan/test2.py”, line 22, in test_gpu
a = torch.rand((matrix_size, matrix_size), device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
as u can see the pod can detect gpu, but while running the python code inside the pod we get this error:
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Even no process is running and the GPU is in the default mode.