Do you see the same error, if you run the code on CPU?
This might yield a clearer error message than the current CUDA one.
If it’s working fine on the CPU, could you rerun the code using
CUDA_LAUNCH_BLOCKING=1 python script.py args
and post the stack trace again?
PS: You can post code directly by wrapping it in three backticks ```