torch.cuda.DeferredCudaCallError: A strange bug after building a cuda extension

Hi! Here I got some bugs about Pytorch and cuda.
I was training my models fine with 2 NVIDIA GeForce RTX 3090 cards on a machine, with a python virtual environment created by conda. To introduce an extra model using cuda extension into my project, I compiled and built the extension running ‘’’ python setup.py install ‘’’ with python interpreter of virtual environment (Here is the compile files I used: GitHub - MultiPath/DepthwiseConv2d: Efficient implementation of Depthwise Conv2d. The main code is from https://github.com/MegEngine). My machine uses CUDA10,2 as default, but I switched to CUDA11.0 to compile file using ‘’’ export CUDA_HOME=/usr/local/cuda-11.0 ‘’’ in SSH.
Just after I compiled this extension into my virtual environment, something went wrong. When traning models, I am no longer able to use single card numbered 0 by setting ‘’’ os.environ[“CUDA_VISIBLE_DEVICES”] = ‘0’ ‘’’ in the beginning of script file on PyCharm, it will always run the single card numbered 1, the same thing happened while setting the value to ‘1’. If I want to train the model using Distributed Data Parallel (DDP) strategy with 2 cards jointly, it will report this:
‘’’
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn(“Error handling mechanism for deadlock detection is uninitialized. Skipping check.”)
Traceback (most recent call last):
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 242, in _lazy_init
queued_call()
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 125, in _check_capability
capability = get_device_capability(d)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 357, in get_device_capability
prop = get_device_properties(device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “xxx/xxx/myProjects/asteroid/egs/librimix/TDANetPlus2/train-birdclefmix.py”, line 148, in
main(arg_dic)
File “xxx/xxx/myProjects/asteroid/egs/librimix/TDANetPlus2/train-birdclefmix.py”, line 113, in main
ckpt_path=None
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 609, in fit
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py”, line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py”, line 88, in launch
return function(*args, **kwargs)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1048, in _run
self.strategy.setup_environment()
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py”, line 153, in setup_environment
super().setup_environment()
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py”, line 131, in setup_environment
self.accelerator.setup_device(self.root_device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/accelerators/cuda.py”, line 43, in setup_device
_check_cuda_matmul_precision(device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/lightning_fabric/accelerators/cuda.py”, line 345, in _check_cuda_matmul_precision
major, _ = torch.cuda.get_device_capability(device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 357, in get_device_capability
prop = get_device_properties(device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 371, in get_device_properties
_lazy_init() # will define _get_device_properties
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 246, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.

CUDA call was originally invoked at:

[’ File “xxx/xxx/myProjects/asteroid/egs/librimix/TDANetPlus2/train-birdclefmix.py”, line 5, in \n import torch\n’, ’ File “”, line 983, in _find_and_load\n’, ’ File “”, line 967, in _find_and_load_unlocked\n’, ’ File “”, line 677, in _load_unlocked\n’, ’ File “”, line 728, in exec_module\n’, ’ File “”, line 219, in _call_with_frames_removed\n’, ’ File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/init.py”, line 798, in \n _C._initExtension(manager_path())\n’, ’ File “”, line 983, in _find_and_load\n’, ’ File “”, line 967, in _find_and_load_unlocked\n’, ’ File “”, line 677, in _load_unlocked\n’, ’ File “”, line 728, in exec_module\n’, ’ File “”, line 219, in _call_with_frames_removed\n’, ’ File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 179, in \n _lazy_call(_check_capability)\n’, ’ File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 177, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n’]
Traceback (most recent call last):
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 242, in _lazy_init
queued_call()
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 125, in _check_capability
capability = get_device_capability(d)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 357, in get_device_capability
prop = get_device_properties(device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “xxx/xxx/myProjects/asteroid/egs/librimix/TDANetPlus2/train-birdclefmix.py”, line 148, in
main(arg_dic)
File “xxx/xxx/myProjects/asteroid/egs/librimix/TDANetPlus2/train-birdclefmix.py”, line 113, in main
ckpt_path=None
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 609, in fit
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(args, kwargs)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1048, in _run
self.strategy.setup_environment()
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py”, line 153, in setup_environment
super().setup_environment()
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py”, line 131, in setup_environment
self.accelerator.setup_device(self.root_device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/pytorch_lightning/accelerators/cuda.py”, line 43, in setup_device
_check_cuda_matmul_precision(device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/lightning_fabric/accelerators/cuda.py”, line 345, in _check_cuda_matmul_precision
major, _ = torch.cuda.get_device_capability(device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 357, in get_device_capability
prop = get_device_properties(device)
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 371, in get_device_properties
_lazy_init() # will define _get_device_properties
File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 246, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.
—(Some information brought by os.environ[“TORCH_SHOW_CPP_STACKTRACES”] = ‘1’)
Exception raised from getDeviceProperties at …/aten/src/ATen/cuda/CUDAContext.cpp:50 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f09df7d0457 in /xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, char const
) + 0x68 (0x7f09df79a4b5 in /xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: at::cuda::getDeviceProperties(long) + 0x13f (0x7f0a27d41c3f in /xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: + 0xbab182 (0x7f0a37e6f182 in /xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x3e5a3a (0x7f0a376a9a3a in /xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: _PyMethodDef_RawFastCallKeywords + 0x237 (0x4aef97 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #6: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae9f0]
frame #7: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #8: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #9: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #10: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #11: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #12: _PyFunction_FastCallKeywords + 0x29c (0x4b9eac in /xxx/xxx/envs/AST2/bin/python3.7)
frame #13: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #14: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #15: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #16: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #17: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #18: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #19: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #20: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #21: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #22: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #23: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #24: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #25: _PyFunction_FastCallKeywords + 0x29c (0x4b9eac in /xxx/xxx/envs/AST2/bin/python3.7)
frame #26: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #27: _PyEval_EvalFrameDefault + 0x468a (0x4ab36a in /xxx/xxx/envs/AST2/bin/python3.7)
frame #28: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #29: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #30: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #31: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #32: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #33: _PyEval_EvalFrameDefault + 0xa9e (0x4a777e in /xxx/xxx/envs/AST2/bin/python3.7)
frame #34: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #35: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #36: _PyEval_EvalFrameDefault + 0x468a (0x4ab36a in /xxx/xxx/envs/AST2/bin/python3.7)
frame #37: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #38: _PyFunction_FastCallKeywords + 0x29c (0x4b9eac in /xxx/xxx/envs/AST2/bin/python3.7)
frame #39: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #40: _PyEval_EvalFrameDefault + 0xa9e (0x4a777e in /xxx/xxx/envs/AST2/bin/python3.7)
frame #41: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #42: _PyFunction_FastCallKeywords + 0x29c (0x4b9eac in /xxx/xxx/envs/AST2/bin/python3.7)
frame #43: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #44: _PyEval_EvalFrameDefault + 0x15d6 (0x4a82b6 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #45: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #46: _PyFunction_FastCallDict + 0x2d7 (0x4c0f57 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #47: /xxx/xxx/envs/AST2/bin/python3.7() [0x4c9a80]
frame #48: PyObject_Call + 0x60 (0x4c7170 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #49: _PyEval_EvalFrameDefault + 0x1ea4 (0x4a8b84 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #50: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #51: _PyFunction_FastCallKeywords + 0x29c (0x4b9eac in /xxx/xxx/envs/AST2/bin/python3.7)
frame #52: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #53: _PyEval_EvalFrameDefault + 0x468a (0x4ab36a in /xxx/xxx/envs/AST2/bin/python3.7)
frame #54: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #55: _PyFunction_FastCallKeywords + 0x29c (0x4b9eac in /xxx/xxx/envs/AST2/bin/python3.7)
frame #56: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #57: _PyEval_EvalFrameDefault + 0x15d6 (0x4a82b6 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #58: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #59: /xxx/xxx/envs/AST2/bin/python3.7() [0x4ae8df]
frame #60: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #61: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #62: PyEval_EvalCodeEx + 0x39 (0x4a5879 in /xxx/xxx/envs/AST2/bin/python3.7)
frame #63: PyEval_EvalCode + 0x1b (0x54a8db in /xxx/xxx/envs/AST2/bin/python3.7)

CUDA call was originally invoked at:

[’ File “xxx/xxx/xxx/train.py”, line 5, in \n import torch\n’, ’ File “”, line 983, in _find_and_load\n’, ’ File “”, line 967, in _find_and_load_unlocked\n’, ’ File “”, line 677, in _load_unlocked\n’, ’ File “”, line 728, in exec_module\n’, ’ File “”, line 219, in _call_with_frames_removed\n’, ’ File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/init.py”, line 798, in \n _C._initExtension(manager_path())\n’, ’ File “”, line 983, in _find_and_load\n’, ’ File “”, line 967, in _find_and_load_unlocked\n’, ’ File “”, line 677, in _load_unlocked\n’, ’ File “”, line 728, in exec_module\n’, ’ File “”, line 219, in _call_with_frames_removed\n’, ’ File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 179, in \n _lazy_call(_check_capability)\n’, ’ File “xxx/xxx/envs/AST2/lib/python3.7/site-packages/torch/cuda/init.py”, line 177, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n’]

Process finished with exit code 1

‘’’

I tried to recreate another new virtual environment and python interpreter to avoid this bug but failed after several times. The same error still exist after I reboot my machine few times, by setting ‘’’ export CUDA_HOME=/usr/local/cuda-x.x ‘’’ from 10.2 to 11.0 or ‘’’ export CUDA_VISIBLE_DEVICES=0,1 ‘’’ in ~/.bashrc file.
I also tried to remove the cuda extension by using ‘’’ rm -r build ‘’’ in the directory I compiled before, but it doesn’t work even though the extention gone.
Mention that by upgrading pytorch-lightning to 2.0.0 and torch to 2.x, I can use the card number 0 or 1 again, but I still can not use DDP to train and this solution doesn’t work properly in my project for package compatibility.
Do you have any idea to fix this problem? Hope for reply and bug fixing, thanks!

Versions

Collecting environment information…
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 5.5.0-12ubuntu1) 5.5.0 20171010
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.16 (default, Jan 17 2023, 22:20:44) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-56-generic-x86_64-with-debian-bullseye-sid
Is CUDA available: True
CUDA runtime version: 10.2.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] numpydoc==1.5.0
[pip3] pytorch-ignite==0.3.0
[pip3] pytorch-lightning==1.9.5
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-transformers==1.2.0
[pip3] torch==1.13.1
[pip3] torch-optimizer==0.3.0
[pip3] torch-stoi==0.1.2
[pip3] torchaudio==0.13.1
[pip3] torchfile==0.1.0
[pip3] torchmetrics==0.11.4
[conda] numpy 1.21.6 pypi_0 pypi
[conda] numpydoc 1.5.0 pypi_0 pypi
[conda] pytorch-lightning 1.9.5 pypi_0 pypi
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch 1.13.1 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torch-stoi 0.1.2 pypi_0 pypi
[conda] torchaudio 0.13.1 pypi_0 pypi
[conda] torchmetrics 0.11.4 pypi_0 pypi

I would completely uninstall all CUDA 10.2 versions since your 3090s need CUDA 11+. Once this is done you could also try to run any other CUDA application on both devices and see if this would work as I don’t think the issue is specific to PyTorch.

1 Like

Thanks for your advice. I made it to uninstall CUDA-10.2 completely, after that I run the script again on both devices but it doesn’t work and reported the same problem. It still can only run on card number 1. ::

After uninstalling CUDA10.2 with CUDA11.0 left, I tried to install a newer CUDA11.2 via local deb. A reboot still doesn’t work in the beginning, but the next day, it threw some exceptions: some files (for example, ~/.bashrc) are read-only, and creating a virtual env in anaconda is not allowed for not enough space, which implies a damage of file system. Tried to remount and adjust but failed. However, with a reboot again, everything back to normal.