The code hits hanging issue and the following stack shows it is stuck at cuda.synchronize(). Is there any idea of what is going on?
pytorch = 1.9.0, cuda=11.1
(gdb) py-bt
Traceback (most recent call first):
<built-in method _cuda_synchronize of module object at remote 0x7f0b8bc0d360>
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 446, in synchronize
return torch._C._cuda_synchronize()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/timer.py", line 165, in start
torch.cuda.synchronize()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2372, in forward
load_optimizer_states=True,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1563, in _call_impl
>>> net = nn.Sequential(l, l)
File "/tmp/code/act/src/qd/opt/trainer.py", line 1531, in inner_loop_step
loss_dict = self.model_engine(self.dict_data)
File "/tmp/code/act/src/qd/opt/trainer.py", line 1672, in inner_loop
losses /= self.gradient_accumulate
File "/tmp/code/act/src/qd/opt/trainer.py", line 1758, in do
self.optimizer.zero_grad()
File "/tmp/code/act/src/qd/pipelines/uni_pipeline.py", line 1743, in do_train
port=self.cfg.dist_url_tcp_port,
File "/tmp/code/act/src/qd/pipelines/uni_pipeline.py", line 4443, in train
File "/tmp/code/act/src/qd/pipelines/uni_pipeline.py", line 1540, in ensure_train
arguments={'iteration': start_iter},
File "/tmp/code/act/src/qd/pipeline.py", line 679, in pipeline_train_eval_multi
pip.ensure_train()
File "src/qd/qd_common.py", line 3349, in execute_func
File "src/qd/qd_common.py", line 4338, in <module>
(gdb)