Could you rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here?
The error points to a wrong kernel launch config. Are you using any custom CUDA code in your application? If not, which PyTorch and CUDA version as well as GPU are you using?
After moving to GPU I got an error “rnn: hx is not contiguous”, so I added .contiguous() to the 5th line of forward() to fix it. Could the CUDA error have something to do with that?
It could potentially be related. Could you post an executable code snippet, which raises the initial error using this module?
It could work, if you set this environment variable before importing any other library, which might initialize the CUDA context. As it’s often not straightforward to use it properly in a Jupyter notebook, I usually recommend to run it in a terminal instead.