initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
INFO: Added key: store_based_barrier_key:1 to store for rank: 1
INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/cuda/init.py:145: UserWarning:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at Start Locally | PyTorch
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py:510: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn(“Error handling mechanism for deadlock detection is uninitialized. Skipping check.”)
/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/cuda/init.py:145: UserWarning:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at Start Locally | PyTorch
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
File “~/novo/novo.py”, line 63, in
main()
File “~/novo/novo.py”, line 47, in main
train(train_data_path, val_data_path, model_path, config_path)
File “/work/08447/se0204/Transformer/main/novo/denovo/train_test.py”, line 144, in train
trainer.fit(model, train_loader.train_dataloader(), val_loader.val_dataloader())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 685, in _call_and_handle_interrupt
return trainer_fn(*args, kwargs)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1138, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1435, in _call_setup_hook
self.training_type_plugin.barrier(“pre_setup”)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py”, line 403, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Exception raised from ~AutoNcclGroup at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x1464f1e5e7d2 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x1464f1e5ae6b in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: + 0x1145f2a (0x1464f34b6f2a in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0x115106d (0x1464f34c206d in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0xf (0x1464f34c309f in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x2d3 (0x1464f34c8ea3 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x72a (0x1464f34d27ba in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x8301e5 (0x1465456ca1e5 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x1f6aa1 (0x146545090aa1 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: _PyMethodDef_RawFastCallKeywords + 0x2ec (0x55ca311e283c in /home1/anaconda3/envs/my_env/bin/python)
frame #10: _PyObject_FastCallKeywords + 0x130 (0x55ca31218140 in /home1/anaconda3/envs/my_env/bin/python)
frame #11: + 0x17fbd1 (0x55ca31218bd1 in /home1/anaconda3/envs/my_env/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x1401 (0x55ca3125d3a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #13: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #14: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #15: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x1401 (0x55ca3125d3a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #17: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #18: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #19: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x661 (0x55ca3125c601 in /home1/anaconda3/envs/my_env/bin/python)
frame #21: _PyFunction_FastCallKeywords + 0x187 (0x55ca311d18d7 in /home1/anaconda3/envs/my_env/bin/python)
frame #22: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x661 (0x55ca3125c601 in /home1/anaconda3/envs/my_env/bin/python)
frame #24: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #25: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #26: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x1401 (0x55ca3125d3a1 in /home1/anaconda3/envs/my_env/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #29: _PyObject_FastCallDict + 0x312 (0x55ca311b3592 in /home1/anaconda3/envs/my_env/bin/python)
frame #30: + 0x12f1c3 (0x55ca311c81c3 in /home1/anaconda3/envs/my_env/bin/python)
frame #31: PyObject_Call + 0xb4 (0x55ca311b3b94 in /home1/anaconda3/envs/my_env/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x1cb8 (0x55ca3125dc58 in /home1/anaconda3/envs/my_env/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #34: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #35: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x661 (0x55ca3125c601 in /home1/anaconda3/envs/my_env/bin/python)
frame #37: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #38: _PyFunction_FastCallKeywords + 0x521 (0x55ca311d1c71 in /home1/anaconda3/envs/my_env/bin/python)
frame #39: + 0x17f9c5 (0x55ca312189c5 in /home1/anaconda3/envs/my_env/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x661 (0x55ca3125c601 in /home1/anaconda3/envs/my_env/bin/python)
frame #41: _PyEval_EvalCodeWithName + 0xdf9 (0x55ca311b2a29 in /home1/anaconda3/envs/my_env/bin/python)
frame #42: _PyFunction_FastCallKeywords + 0x583 (0x55ca311d1cd3 in /home1/anaconda3/envs/my_env/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x3f5 (0x55ca3125c395 in /home1/anaconda3/envs/my_env/bin/python)
frame #44: _PyFunction_FastCallKeywords + 0x187 (0x55ca311d18d7 in /home1/anaconda3/envs/my_env/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x3f5 (0x55ca3125c395 in /home1/anaconda3/envs/my_env/bin/python)
frame #46: _PyEval_EvalCodeWithName + 0x255 (0x55ca311b1e85 in /home1/anaconda3/envs/my_env/bin/python)
frame #47: PyEval_EvalCode + 0x23 (0x55ca311b3273 in /home1/anaconda3/envs/my_env/bin/python)
frame #48: + 0x227c82 (0x55ca312c0c82 in /home1/anaconda3/envs/my_env/bin/python)
frame #49: PyRun_FileExFlags + 0x9e (0x55ca312cae1e in /home1/anaconda3/envs/my_env/bin/python)
frame #50: PyRun_SimpleFileExFlags + 0x1bb (0x55ca312cb00b in /home1/anaconda3/envs/my_env/bin/python)
frame #51: + 0x2330fa (0x55ca312cc0fa in /home1/anaconda3/envs/my_env/bin/python)
frame #52: _Py_UnixMain + 0x3c (0x55ca312cc18c in /home1/anaconda3/envs/my_env/bin/python)
frame #53: __libc_start_main + 0xf3 (0x1465549ef4a3 in /usr/lib64/libc.so.6)
frame #54: + 0x1d803a (0x55ca3127103a in /home1/anaconda3/envs/my_env/bin/python)
Traceback (most recent call last):
File “/work/08447/se0204/Transformer/main/novo/novo.py”, line 63, in
main()
File “~/novo.py”, line 47, in main
train(train_data_path, val_data_path, model_path, config_path)
File “~/novo/denovo/train_test.py”, line 144, in train
trainer.fit(model, train_loader.train_dataloader(), val_loader.val_dataloader())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 685, in _call_and_handle_interrupt
return trainer_fn(*args, kwargs)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1138, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1435, in _call_setup_hook
self.training_type_plugin.barrier(“pre_setup”)
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py”, line 403, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File “/home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Exception raised from ~AutoNcclGroup at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x1547458d27d2 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x1547458cee6b in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: + 0x1145f2a (0x154746f2af2a in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0x115106d (0x154746f3606d in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0xf (0x154746f3709f in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x2d3 (0x154746f3cea3 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x72a (0x154746f467ba in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x8301e5 (0x15479913e1e5 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x1f6aa1 (0x154798b04aa1 in /home1/anaconda3/envs/my_env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #53: __libc_start_main + 0xf3 (0x1547a84634a3 in /usr/lib64/libc.so.6)