Our torch.tensor() calls are failing with the below error:
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Our driver setup / printout is the below:
NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2
__Python VERSION: 3.10.11 (main, Nov 30 2023, 18:20:49) [GCC 7.5.0]
__pyTorch VERSION: 2.1.0+cu121
__CUDA VERSION 12.1
__CUDNN VERSION: 8902
__Is CUDA available: True
__Number CUDA Devices: 1
Active CUDA Device: GPU 0
Available devices 1
Current cuda device 0
GPU count: 1
Trying to understand if there’s a weird version compatibility here causing this issue- in our container image we specify CUDA 12.3 & Torch 2.2.1, so it looks like there’s an override happening somewhere