Torch.tensor(): CUDA-capable device(s) is/are busy or unavailable

Christian_Stano · April 5, 2024, 8:28pm

Our torch.tensor() calls are failing with the below error:
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Our driver setup / printout is the below:
NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2
__Python VERSION: 3.10.11 (main, Nov 30 2023, 18:20:49) [GCC 7.5.0]
__pyTorch VERSION: 2.1.0+cu121
__CUDA VERSION 12.1
__CUDNN VERSION: 8902
__Is CUDA available: True
__Number CUDA Devices: 1
Active CUDA Device: GPU 0
Available devices 1
Current cuda device 0
GPU count: 1

Trying to understand if there’s a weird version compatibility here causing this issue- in our container image we specify CUDA 12.3 & Torch 2.2.1, so it looks like there’s an override happening somewhere

ptrblck · April 5, 2024, 8:46pm

The PyTorch binaries ship with their own CUDA runtime dependencies and your locally installed CUDA toolkit will be used of you build PyTorch from source or a custom CUDA extension.
Make sure your container can run any CUDA application, as the error might point to your setup.