Pytorch version incompatible with cuda

I’ve got this error after running the code written with PyTorch on two GPUs:
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

torch version: ‘1.11.0+cu102’
cuda version: 11.4 driver: 470.63.01

Could you please tell if this error happened because of incompatibility of cuda and torch version? and how can I fix it?
Thanks!

Could you rerun your script with NCCL_DEBUG=INFO and post the log here, please?

1 Like

Here is the log:
trainer.fit(model, train_loader.train_dataloader(), val_loader.val_dataloader())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1138, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1435, in _call_setup_hook
self.training_type_plugin.barrier(“pre_setup”)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py”, line 403, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
~

This output doesn’t show any NCCL logs but just the PyTorch stacktrace, so make sure to export this env variable before rerunning your script.

I think this is NCCL logs:
c309-002:278034:278034 [0] enqueue.cc:102 NCCL WARN Cuda failure ‘invalid device function’
c309-002:278034:278034 [0] NCCL INFO Bootstrap : Using ib0:192.168.41.194<0>
c309-002:278034:278034 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
c309-002:278034:278034 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.41.194<0>
c309-002:278034:278034 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2

c309-002:278123:278123 [1] enqueue.cc:102 NCCL WARN Cuda failure ‘invalid device function’
c309-002:278123:278123 [1] NCCL INFO Bootstrap : Using ib0:192.168.41.194<0>
c309-002:278123:278123 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
c309-002:278123:278123 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.41.194<0>
c309-002:278123:278123 [1] NCCL INFO Using network IB
c309-002:278034:278243 [0] NCCL INFO Channel 00/02 : 0 1
c309-002:278123:278245 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
c309-002:278034:278243 [0] NCCL INFO Channel 01/02 : 0 1
c309-002:278123:278245 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000
c309-002:278034:278243 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
c309-002:278034:278243 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
c309-002:278123:278245 [1] NCCL INFO Channel 00 : 1[81000] → 0[21000] via P2P/IPC
c309-002:278123:278245 [1] NCCL INFO Channel 01 : 1[81000] → 0[21000] via P2P/IPC
c309-002:278034:278243 [0] NCCL INFO Channel 00 : 0[21000] → 1[81000] via P2P/IPC
c309-002:278034:278243 [0] NCCL INFO Channel 01 : 0[21000] → 1[81000] via P2P/IPC
c309-002:278123:278245 [1] NCCL INFO Connected all rings
c309-002:278123:278245 [1] NCCL INFO Connected all trees
c309-002:278034:278243 [0] NCCL INFO Connected all rings
c309-002:278034:278243 [0] NCCL INFO Connected all trees
c309-002:278123:278245 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
c309-002:278123:278245 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
c309-002:278034:278243 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
c309-002:278034:278243 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
c309-002:278123:278245 [1] NCCL INFO comm 0x154ba4001240 rank 1 nranks 2 cudaDev 1 busId 81000 - Init COMPLETE
c309-002:278034:278243 [0] NCCL INFO comm 0x14c2d0001240 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE
c309-002:278034:278034 [0] NCCL INFO Launch mode Parallel

c309-002:278123:278123 [1] enqueue.cc:300 NCCL WARN Cuda failure ‘invalid device function’
c309-002:278123:278123 [1] NCCL INFO group.cc:347 → 1

c309-002:278034:278034 [0] enqueue.cc:300 NCCL WARN Cuda failure ‘invalid device function’
c309-002:278034:278034 [0] NCCL INFO group.cc:347 → 1
~

Yes, these are the NCL logs, but they also don’t show any previous error besides:

enqueue.cc:102 NCCL WARN Cuda failure ‘invalid device function’

Try to rerun the code via additional debugging variables:

TORCH_DISTRIBUTED_DEBUG=INFO
TORCH_SHOW_CPP_STACKTRACES=1
1 Like

trainer.fit(model, train_loader.train_dataloader(), val_loader.val_dataloader())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 685, in _call_and_handle_interrupt
return trainer_fn(*args, kwargs)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1138, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1435, in _call_setup_hook
self.training_type_plugin.barrier(“pre_setup”)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py”, line 403, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Exception raised from ~AutoNcclGroup at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x14d6935a77d2 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x5b (0x14d6935a3e6b in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: + 0x1145f2a (0x14d694bfff2a in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0x115106d (0x14d694c0b06d in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0xf (0x14d694c0c09f in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x2d3 (0x14d694c11ea3 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x72a (0x14d694c1b7ba in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x8301e5 (0x14d6e6e131e5 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x1f6aa1 (0x14d6e67d9aa1 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: _PyMethodDef_RawFastCallKeywords + 0x2ec (0x556d69f1d83c in /home1/anaconda3/envs/my_env/bin/python)
frame #10: _PyObject_FastCallKeywords + 0x130 (0x556d69f53140 in /home1/anaconda3/envs/my_env/bin/python)
frame #11: + 0x17fbd1 (0x556d69f53bd1 in /home1/anaconda3/envs/my_env/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x1401 (0x556d69f983a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #13: _PyEval_EvalCodeWithName + 0x255 (0x556d69eece85 in /home1/anaconda3/envs/my_env/bin/python)
frame #14: _PyFunction_FastCallKeywords + 0x583 (0x556d69f0ccd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #15: + 0x17f9c5 (0x556d69f539c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x1401 (0x556d69f983a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #17: _PyEval_EvalCodeWithName + 0x255 (0x556d69eece85 in /home1/anaconda3/envs/my_env/bin/python)
frame #18: _PyFunction_FastCallKeywords + 0x583 (0x556d69f0ccd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #19: + 0x17f9c5 (0x556d69f539c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x661 (0x556d69f97601 in /home1/anaconda3/envs/my_env/bin/python)
frame #21: _PyFunction_FastCallKeywords + 0x187 (0x556d69f0c8d7 in /home1/anaconda3/envs/my_env/bin/python)
frame #22: + 0x17f9c5 (0x556d69f539c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x661 (0x556d69f97601 in /home1/anaconda3/envs/my_env/bin/python)
frame #24: _PyEval_EvalCodeWithName + 0x255 (0x556d69eece85 in /home1/anaconda3/envs/my_env/bin/python)
frame #25: _PyFunction_FastCallKeywords + 0x583 (0x556d69f0ccd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #26: + 0x17f9c5 (0x556d69f539c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x1401 (0x556d69f983a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x255 (0x556d69eece85 in /home1/anaconda3/envs/my_env/bin/python)
frame #29: _PyObject_FastCallDict + 0x312 (0x556d69eee592 in /home1/anaconda3/envs/my_env/bin/python)
frame #30: + 0x12f1c3 (0x556d69f031c3 in /home1/anaconda3/envs/my_env/bin/python)
frame #31: PyObject_Call + 0xb4 (0x556d69eeeb94 in /home1/anaconda3/envs/my_env/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x1cb8 (0x556d69f98c58 in /home1/anaconda3/envs/my_env/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x255 (0x556d69eece85 in /home1/anaconda3/envs/my_env/bin/python)
frame #34: _PyFunction_FastCallKeywords + 0x583 (0x556d69f0ccd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #35: + 0x17f9c5 (0x556d69f539c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x661 (0x556d69f97601 in /home1/anaconda3/envs/my_env/bin/python)
frame #37: _PyEval_EvalCodeWithName + 0x255 (0x556d69eece85 in /home1/anaconda3/envs/my_env/bin/python)
frame #38: _PyFunction_FastCallKeywords + 0x521 (0x556d69f0cc71 in /home1/anaconda3/envs/my_env/bin/python)
frame #39: + 0x17f9c5 (0x556d69f539c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x661 (0x556d69f97601 in /home1/anaconda3/envs/my_env/bin/python)
frame #41: _PyEval_EvalCodeWithName + 0xdf9 (0x556d69eeda29 in /home1/anaconda3/envs/my_env/bin/python)
frame #42: _PyFunction_FastCallKeywords + 0x583 (0x556d69f0ccd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x3f5 (0x556d69f97395 in /home1/anaconda3/envs/my_env/bin/python)
frame #44: _PyFunction_FastCallKeywords + 0x187 (0x556d69f0c8d7 in /home1/anaconda3/envs/my_env/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x3f5 (0x556d69f97395 in /home1/anaconda3/envs/my_env/bin/python)
frame #46: _PyEval_EvalCodeWithName + 0x255 (0x556d69eece85 in /home1/anaconda3/envs/my_env/bin/python)
frame #47: PyEval_EvalCode + 0x23 (0x556d69eee273 in /home1/anaconda3/envs/my_env/bin/python)
frame #48: + 0x227c82 (0x556d69ffbc82 in /home1/anaconda3/envs/my_env/bin/python)
frame #49: PyRun_FileExFlags + 0x9e (0x556d6a005e1e in /home1/anaconda3/envs/my_env/bin/python)
frame #50: PyRun_SimpleFileExFlags + 0x1bb (0x556d6a00600b in /home1/anaconda3/envs/my_env/bin/python)
frame #51: + 0x2330fa (0x556d6a0070fa in /home1/anaconda3/envs/my_env/bin/python)
frame #52: _Py_UnixMain + 0x3c (0x556d6a00718c in /home1/anaconda3/envs/my_env/bin/python)
frame #53: __libc_start_main + 0xf3 (0x14d6f61384a3 in /usr/lib64/libc.so.6)
frame #54: + 0x1d803a (0x556d69fac03a in /home1/anaconda3/envs/my_env/bin/python)

trainer.fit(model, train_loader.train_dataloader(), val_loader.val_dataloader())

File “/home1/anaconda3/envs/cmy_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 685, in _call_and_handle_interrupt
return trainer_fn(*args, kwargs)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1138, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1435, in _call_setup_hook
self.training_type_plugin.barrier(“pre_setup”)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py”, line 403, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Exception raised from ~AutoNcclGroup at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x1552110d17d2 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x5b (0x1552110cde6b in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: + 0x1145f2a (0x155212729f2a in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0x115106d (0x15521273506d in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0xf (0x15521273609f in /home1//anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x2d3 (0x15521273bea3 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x72a (0x1552127457ba in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x8301e5 (0x15526493d1e5 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x1f6aa1 (0x155264303aa1 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #53: __libc_start_main + 0xf3 (0x155273c624a3 in /usr/lib64/libc.so.6)

Unfortunately, the NCCL logs are gone again. It might be easier if you could post a minimal, executable code snippet to reproduce the issue as well as the output of python -m torch.utils.collect_env, please?

Collecting environment information…
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Rocky Linux release 8.4 (Green Obsidian) (x86_64)
GCC version: (GCC) 9.4.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.18.0-305.12.1.el8_4.x86_64-x86_64-with-redhat-8.4-Green_Obsidian
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB

Nvidia driver version: 470.63.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] pytorch-lightning==1.5.10
[pip3] torch==1.11.0
[pip3] torchmetrics==0.9.1
[conda] numpy 1.21.6 pypi_0 pypi
[conda] pytorch-lightning 1.5.10 pypi_0 pypi
[conda] torch 1.11.0 pypi_0 pypi
[conda] torchmetrics 0.9.1 pypi_0 pypi

c305-001:221921:221921 [0] enqueue.cc:102 NCCL WARN Cuda failure ‘invalid device function’
c305-001:221921:221921 [0] NCCL INFO Bootstrap : Using ib0:192.168.41.97<0>
c305-001:221921:221921 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
c305-001:221921:221921 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.41.97<0>
c305-001:221921:221921 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2

c305-001:222031:222031 [1] enqueue.cc:102 NCCL WARN Cuda failure ‘invalid device function’
c305-001:222031:222031 [1] NCCL INFO Bootstrap : Using ib0:192.168.41.97<0>
c305-001:222031:222031 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
c305-001:222031:222031 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.41.97<0>
c305-001:222031:222031 [1] NCCL INFO Using network IB
c305-001:222031:222058 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
c305-001:221921:222056 [0] NCCL INFO Channel 00/02 : 0 1
c305-001:222031:222058 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000
c305-001:221921:222056 [0] NCCL INFO Channel 01/02 : 0 1
c305-001:221921:222056 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
c305-001:221921:222056 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
c305-001:222031:222058 [1] NCCL INFO Channel 00 : 1[81000] → 0[21000] via P2P/IPC
c305-001:222031:222058 [1] NCCL INFO Channel 01 : 1[81000] → 0[21000] via P2P/IPC
c305-001:221921:222056 [0] NCCL INFO Channel 00 : 0[21000] → 1[81000] via P2P/IPC
c305-001:221921:222056 [0] NCCL INFO Channel 01 : 0[21000] → 1[81000] via P2P/IPC
c305-001:222031:222058 [1] NCCL INFO Connected all rings
c305-001:222031:222058 [1] NCCL INFO Connected all trees
c305-001:221921:222056 [0] NCCL INFO Connected all rings
c305-001:221921:222056 [0] NCCL INFO Connected all trees
c305-001:222031:222058 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
c305-001:222031:222058 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
c305-001:221921:222056 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
c305-001:221921:222056 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
c305-001:222031:222058 [1] NCCL INFO comm 0x14646c001240 rank 1 nranks 2 cudaDev 1 busId 81000 - Init COMPLETE
c305-001:221921:222056 [0] NCCL INFO comm 0x1546b8001240 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE
c305-001:221921:221921 [0] NCCL INFO Launch mode Parallel

c305-001:221921:221921 [0] enqueue.cc:300 NCCL WARN Cuda failure ‘invalid device function’
c305-001:221921:221921 [0] NCCL INFO group.cc:347 → 1

c305-001:222031:222031 [1] enqueue.cc:300 NCCL WARN Cuda failure ‘invalid device function’
c305-001:222031:222031 [1] NCCL INFO group.cc:347 → 1

initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
INFO: Added key: store_based_barrier_key:1 to store for rank: 1
INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/cuda/init.py:145: UserWarning:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at Start Locally | PyTorch

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py:510: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn(“Error handling mechanism for deadlock detection is uninitialized. Skipping check.”)
/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/cuda/init.py:145: UserWarning:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at Start Locally | PyTorch

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
File “~/novo/novo.py”, line 63, in
main()
File “~/novo/novo.py”, line 47, in main
train(train_data_path, val_data_path, model_path, config_path)
File “/work/08447/se0204/Transformer/main/novo/denovo/train_test.py”, line 144, in train
trainer.fit(model, train_loader.train_dataloader(), val_loader.val_dataloader())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 685, in _call_and_handle_interrupt
return trainer_fn(*args, kwargs)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1138, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1435, in _call_setup_hook
self.training_type_plugin.barrier(“pre_setup”)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py”, line 403, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Exception raised from ~AutoNcclGroup at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x1464f1e5e7d2 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x5b (0x1464f1e5ae6b in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: + 0x1145f2a (0x1464f34b6f2a in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0x115106d (0x1464f34c206d in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0xf (0x1464f34c309f in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x2d3 (0x1464f34c8ea3 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x72a (0x1464f34d27ba in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x8301e5 (0x1465456ca1e5 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x1f6aa1 (0x146545090aa1 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: _PyMethodDef_RawFastCallKeywords + 0x2ec (0x55ca311e283c in /home1/anaconda3/envs/my_env/bin/python)
frame #10: _PyObject_FastCallKeywords + 0x130 (0x55ca31218140 in /home1/anaconda3/envs/my_env/bin/python)
frame #11: + 0x17fbd1 (0x55ca31218bd1 in /home1/anaconda3/envs/my_env/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x1401 (0x55ca3125d3a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #13: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #14: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #15: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x1401 (0x55ca3125d3a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #17: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #18: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #19: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x661 (0x55ca3125c601 in /home1/anaconda3/envs/my_env/bin/python)
frame #21: _PyFunction_FastCallKeywords + 0x187 (0x55ca311d18d7 in /home1/anaconda3/envs/my_env/bin/python)
frame #22: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x661 (0x55ca3125c601 in /home1/anaconda3/envs/my_env/bin/python)
frame #24: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #25: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #26: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x1401 (0x55ca3125d3a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #29: _PyObject_FastCallDict + 0x312 (0x55ca311b3592 in /home1/anaconda3/envs/my_env/bin/python)
frame #30: + 0x12f1c3 (0x55ca311c81c3 in /home1/anaconda3/envs/my_env/bin/python)
frame #31: PyObject_Call + 0xb4 (0x55ca311b3b94 in /home1/anaconda3/envs/my_env/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x1cb8 (0x55ca3125dc58 in /home1/anaconda3/envs/my_env/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #34: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #35: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x661 (0x55ca3125c601 in /home1/anaconda3/envs/my_env/bin/python)
frame #37: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #38: _PyFunction_FastCallKeywords + 0x521 (0x55ca311d1c71 in /home1/anaconda3/envs/my_env/bin/python)
frame #39: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x661 (0x55ca3125c601 in /home1/anaconda3/envs/my_env/bin/python)
frame #41: _PyEval_EvalCodeWithName + 0xdf9 (0x55ca311b2a29 in /home1/anaconda3/envs/my_env/bin/python)
frame #42: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x3f5 (0x55ca3125c395 in /home1/anaconda3/envs/my_env/bin/python)
frame #44: _PyFunction_FastCallKeywords + 0x187 (0x55ca311d18d7 in /home1/anaconda3/envs/my_env/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x3f5 (0x55ca3125c395 in /home1/anaconda3/envs/my_env/bin/python)
frame #46: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #47: PyEval_EvalCode + 0x23 (0x55ca311b3273 in /home1/anaconda3/envs/my_env/bin/python)
frame #48: + 0x227c82 (0x55ca312c0c82 in /home1/anaconda3/envs/my_env/bin/python)
frame #49: PyRun_FileExFlags + 0x9e (0x55ca312cae1e in /home1/anaconda3/envs/my_env/bin/python)
frame #50: PyRun_SimpleFileExFlags + 0x1bb (0x55ca312cb00b in /home1/anaconda3/envs/my_env/bin/python)
frame #51: + 0x2330fa (0x55ca312cc0fa in /home1/anaconda3/envs/my_env/bin/python)
frame #52: _Py_UnixMain + 0x3c (0x55ca312cc18c in /home1/anaconda3/envs/my_env/bin/python)
frame #53: __libc_start_main + 0xf3 (0x1465549ef4a3 in /usr/lib64/libc.so.6)
frame #54: + 0x1d803a (0x55ca3127103a in /home1/anaconda3/envs/my_env/bin/python)

Traceback (most recent call last):
File “/work/08447/se0204/Transformer/main/novo/novo.py”, line 63, in
main()
File “~/novo.py”, line 47, in main
train(train_data_path, val_data_path, model_path, config_path)
File “~/novo/denovo/train_test.py”, line 144, in train
trainer.fit(model, train_loader.train_dataloader(), val_loader.val_dataloader())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 685, in _call_and_handle_interrupt
return trainer_fn(*args, kwargs)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1138, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1435, in _call_setup_hook
self.training_type_plugin.barrier(“pre_setup”)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py”, line 403, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Exception raised from ~AutoNcclGroup at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x1547458d27d2 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x5b (0x1547458cee6b in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: + 0x1145f2a (0x154746f2af2a in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0x115106d (0x154746f3606d in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0xf (0x154746f3709f in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x2d3 (0x154746f3cea3 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x72a (0x154746f467ba in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x8301e5 (0x15479913e1e5 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x1f6aa1 (0x154798b04aa1 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #53: __libc_start_main + 0xf3 (0x1547a84634a3 in /usr/lib64/libc.so.6)

Thanks for the updated logs.
The error is raised by:

The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at Start Locally | PyTorch

and NCCL just reraises it.
Your A100 GPU needs CUDA>=11.0 so install the binaries with the 11.3 or 11.6 runtime.

1 Like

Hi @ptrblck , I am stuck for last few days on this error below when running the script using torch.distributed.launch python3.8 -m torch.distributed.launch \ --nnodes=1 \ --node_rank=0 \ --nproc_per_node=2 \ train_stdec.py.

Sorry for my inability to post question properly. The error looks like this.

deep_lab:31220:31363 [1] enqueue.cc:215 NCCL WARN Cuda failure ‘the launch timed out and was terminated’
deep_lab:31220:31363 [1] NCCL INFO group.cc:282 → 1
Traceback (most recent call last):
File “train_stdec.py”, line 743, in main
run(args, model_params, device, prediction_model)
File “train_stdec.py”, line 493, in run
h = train_epoch(args, model, prediction_model, optimizer, dataloaders[‘train’], device, epoch, history=h)
File “train_stdec.py”, line 368, in train_epoch
epoch_loss.backward(retain_graph=False)
File “/home/deep_lab/.local/lib/python3.8/site-packages/torch/tensor.py”, line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/home/deep_lab/.local/lib/python3.8/site-packages/torch/autograd/init.py”, line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:33, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Here is the output from NCCL_DEBUG=INFO

deep_lab:31220:31220 [1] NCCL INFO Bootstrap : Using [0]enp11s0:10.2.3.219<0>
deep_lab:31220:31220 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
deep_lab:31220:31220 [1] NCCL INFO NET/IB : No device found.
deep_lab:31220:31220 [1] NCCL INFO NET/Socket : Using [0]enp11s0:10.2.3.219<0>
deep_lab:31220:31220 [1] NCCL INFO Using network Socket
deep_lab:31219:31300 [0] NCCL INFO Channel 00/02 : 0 1
deep_lab:31220:31303 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
deep_lab:31219:31300 [0] NCCL INFO Channel 01/02 : 0 1
deep_lab:31220:31303 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
deep_lab:31219:31300 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
deep_lab:31219:31300 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
deep_lab:31219:31300 [0] NCCL INFO Channel 00 : 0[1000] → 1[2000] via direct shared memory
deep_lab:31220:31303 [1] NCCL INFO Channel 00 : 1[2000] → 0[1000] via direct shared memory
deep_lab:31219:31300 [0] NCCL INFO Channel 01 : 0[1000] → 1[2000] via direct shared memory
deep_lab:31220:31303 [1] NCCL INFO Channel 01 : 1[2000] → 0[1000] via direct shared memory
deep_lab:31219:31300 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
deep_lab:31219:31300 [0] NCCL INFO comm 0x7f33c0002e10 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
deep_lab:31219:31219 [0] NCCL INFO Launch mode Parallel
deep_lab:31220:31303 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
deep_lab:31220:31303 [1] NCCL INFO comm 0x7f2150002e10 rank 1 nranks 2 cudaDev 1 busId 2000 - Init COMPLETE

The output from CUDA_LAUNCH_BLOCKING=1 is as below:

terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f226f5a92f2 in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f226f5a667b in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f226f8011f9 in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f226f5913a4 in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x2f9 (0x7f22e36d5cc9 in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7f22e36cac8a in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f22e36f1f22 in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f22e302ee76 in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa2121f (0x7f22e36f521f in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x369f80 (0x7f22e303df80 in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x36b1ee (0x7f22e303f1ee in /home/deep_lab/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3.8() [0x5d28f4]
frame #12: /usr/bin/python3.8() [0x5a729d]
frame #13: /usr/bin/python3.8() [0x5ec780]
frame #14: /usr/bin/python3.8() [0x5ec84a]
frame #15: /usr/bin/python3.8() [0x5ec84a]
frame #16: /usr/bin/python3.8() [0x5ec84a]
frame #17: /usr/bin/python3.8() [0x5ec84a]
frame #18: /usr/bin/python3.8() [0x5ec84a]
frame #19: /usr/bin/python3.8() [0x5ec84a]
frame #20: /usr/bin/python3.8() [0x5ec84a]
frame #21: /usr/bin/python3.8() [0x5ec84a]
frame #22: /usr/bin/python3.8() [0x5441f8]
frame #23: /usr/bin/python3.8() [0x5ef9e6]
frame #24: /usr/bin/python3.8() [0x6af687]
frame #25: /usr/bin/python3.8() [0x5ef9df]
frame #26: /usr/bin/python3.8() [0x6af687]
frame #27: _PyModule_ClearDict + 0xe0c (0x5c6aac in /usr/bin/python3.8)
frame #28: PyImport_Cleanup + 0x2be (0x68485e in /usr/bin/python3.8)
frame #29: Py_FinalizeEx + 0x7f (0x67f8af in /usr/bin/python3.8)
frame #30: Py_RunMain + 0x32d (0x6b70fd in /usr/bin/python3.8)
frame #31: Py_BytesMain + 0x2d (0x6b736d in /usr/bin/python3.8)
frame #32: __libc_start_main + 0xf3 (0x7f22e9216083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #33: _start + 0x2e (0x5fa5ce in /usr/bin/python3.8)

The output from python -m torch.utils.collect_env looks like this:

Collecting environment information…
PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX TITAN X
GPU 1: NVIDIA GeForce GTX TITAN X

Nvidia driver version: 465.19.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.23.0
[pip3] numpydoc==1.4.0
[pip3] torch==1.8.1+cu111
[pip3] torchaudio==0.8.1
[pip3] torchvision==0.9.1+cu111
[conda] Could not collect

This error:

 CUDA error: the launch timed out and was terminated

is raised if the OS kills the kernel.
This is usually the case if the GPU is used for video output (e.g. X window) and the OS wants to avoid a lagging video output.
As a workaround you could kill the X server and rerun the workload in a terminal only to see if it would be working.