Hello, everyone. I have been trying to use DDP to train a transformer. It all runs well on our own cluster, but after I transfer the code and the env to a server rent from an outside company, some bugs occur at torch.nn.parallel.DistributedDataParallel(model)
Here is the code and the error information with NCCL_DEBUG=INFO.
Could you please tell me what should I do to fix it or what should I report to the server provider?
Blockquote
parser = argparse.ArgumentParser()
parser.add_argument(‘–local_rank’, default=-1, type=int)
args = parser.parse_args()
print(args.local_rank)
torch.distributed.init_process_group(backend=“nccl”, init_method=‘env://’)
device = torch.device(‘cuda’, args.local_rank)
self.BartNN = self.BartNN.to(device)
self.BartNN = torch.nn.parallel.DistributedDataParallel(self.BartNN)
Blockquote
python -m torch.distributed.launch --nproc_per_node=4 main.py
sts-yymx4-wucs-0:1477:1477 [0] NCCL INFO Bootstrap : Using eth0:10.236.31.187<0>
sts-yymx4-wucs-0:1477:1477 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
sts-yymx4-wucs-0:1477:1477 [0] NCCL INFO cudaDriverVersion 11060
NCCL version 2.14.3+cuda11.6
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO NET/IB : Using [0]mlx5_9:1/RoCE [1]mlx5_88:1/RoCE [RO]; OOB eth0:10.236.31.187<0>
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Using network IB
sts-yymx4-wucs-0:1479:1479 [2] NCCL INFO cudaDriverVersion 11060
sts-yymx4-wucs-0:1479:1479 [2] NCCL INFO Bootstrap : Using eth0:10.236.31.187<0>
sts-yymx4-wucs-0:1479:1479 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO NET/IB : Using [0]mlx5_9:1/RoCE [1]mlx5_88:1/RoCE [RO]; OOB eth0:10.236.31.187<0>
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO Using network IB
sts-yymx4-wucs-0:1480:1480 [3] NCCL INFO cudaDriverVersion 11060
sts-yymx4-wucs-0:1480:1480 [3] NCCL INFO Bootstrap : Using eth0:10.236.31.187<0>
sts-yymx4-wucs-0:1480:1480 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO NET/IB : Using [0]mlx5_9:1/RoCE [1]mlx5_88:1/RoCE [RO]; OOB eth0:10.236.31.187<0>
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO Using network IB
sts-yymx4-wucs-0:1478:1478 [1] NCCL INFO cudaDriverVersion 11060
sts-yymx4-wucs-0:1478:1478 [1] NCCL INFO Bootstrap : Using eth0:10.236.31.187<0>
sts-yymx4-wucs-0:1478:1478 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO NET/IB : Using [0]mlx5_9:1/RoCE [1]mlx5_88:1/RoCE [RO]; OOB eth0:10.236.31.187<0>
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO Using network IB
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,ffff0000
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,ffff0000
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO Setting affinity for GPU 1 to ffff,0000ffff
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Channel 00/02 : 0 1 2 3
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Channel 01/02 : 0 1 2 3
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Channel 00/0 : 3[b9000] -> 0[41000] [receive] via NET/IB/1
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO Channel 00/0 : 1[45000] -> 2[b5000] [receive] via NET/IB/0
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO Channel 00/0 : 1[45000] -> 2[b5000] [send] via NET/IB/1
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO Channel 00/0 : 3[b9000] -> 0[41000] [send] via NET/IB/0
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Channel 01/0 : 3[b9000] -> 0[41000] [receive] via NET/IB/1
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO Channel 01/0 : 1[45000] -> 2[b5000] [receive] via NET/IB/0
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Channel 00/0 : 0[41000] -> 1[45000] via P2P/IPC
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO Channel 00/0 : 2[b5000] -> 3[b9000] via P2P/IPC
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO Channel 01/0 : 0[41000] -> 1[45000] via P2P/IPC
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO Channel 01/0 : 2[b5000] -> 3[b9000] via P2P/IPC
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO Channel 01/0 : 1[45000] -> 2[b5000] [send] via NET/IB/1
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO Channel 01/0 : 3[b9000] -> 0[41000] [send] via NET/IB/0
sts-yymx4-wucs-0:1478:1531 [1] misc/ibvwrap.cc:262 NCCL WARN Call to ibv_reg_mr failed with error Cannot allocate memory
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO transport/net_ib.cc:632 -> 2
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO include/net.h:26 -> 2
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO transport/net.cc:517 -> 2
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO proxy.cc:991 -> 2
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO proxy.cc:1019 -> 2
sts-yymx4-wucs-0:1478:1531 [1] proxy.cc:1119 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 2
sts-yymx4-wucs-0:1478:1531 [1] misc/ibvwrap.cc:299 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO transport/net_ib.cc:485 -> 2
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO transport/net_ib.cc:615 -> 2
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO include/net.h:26 -> 2
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO transport/net.cc:517 -> 2
sts-yymx4-wucs-0:1478:1531 [1] NCCL INFO proxy.cc:991 -> 2
sts-yymx4-wucs-0:1478:1531 [1] proxy.cc:1119 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 2
sts-yymx4-wucs-0:1478:1526 [1] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local<40037>
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO misc/socket.cc:546 -> 6
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO misc/socket.cc:558 -> 6
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO proxy.cc:881 -> 6
sts-yymx4-wucs-0:1478:1526 [1] proxy.cc:884 NCCL WARN Proxy Call to rank 1 failed (Connect)
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO transport/net.cc:265 -> 6
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO transport.cc:124 -> 6
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO init.cc:790 -> 6
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO init.cc:1089 -> 6
sts-yymx4-wucs-0:1478:1526 [1] NCCL INFO group.cc:64 -> 6 [Async thread]
sts-yymx4-wucs-0:1478:1478 [1] NCCL INFO group.cc:421 -> 3
sts-yymx4-wucs-0:1478:1478 [1] NCCL INFO group.cc:106 -> 3
sts-yymx4-wucs-0:1480:1530 [3] misc/ibvwrap.cc:262 NCCL WARN Call to ibv_reg_mr failed with error Cannot allocate memory
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO transport/net_ib.cc:632 -> 2
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO include/net.h:26 -> 2
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO transport/net.cc:517 -> 2
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO proxy.cc:991 -> 2
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO proxy.cc:1019 -> 2
sts-yymx4-wucs-0:1480:1530 [3] proxy.cc:1119 NCCL WARN [Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 2
sts-yymx4-wucs-0:1480:1530 [3] misc/ibvwrap.cc:299 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO transport/net_ib.cc:485 -> 2
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO transport/net_ib.cc:615 -> 2
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO include/net.h:26 -> 2
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO transport/net.cc:517 -> 2
sts-yymx4-wucs-0:1480:1530 [3] NCCL INFO proxy.cc:991 -> 2
sts-yymx4-wucs-0:1480:1530 [3] proxy.cc:1119 NCCL WARN [Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 2
sts-yymx4-wucs-0:1480:1523 [3] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local<57659>
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO misc/socket.cc:546 -> 6
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO misc/socket.cc:558 -> 6
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO proxy.cc:881 -> 6
sts-yymx4-wucs-0:1480:1523 [3] proxy.cc:884 NCCL WARN Proxy Call to rank 3 failed (Connect)
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO transport/net.cc:265 -> 6
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO transport.cc:124 -> 6
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO init.cc:790 -> 6
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO init.cc:1089 -> 6
sts-yymx4-wucs-0:1480:1523 [3] NCCL INFO group.cc:64 -> 6 [Async thread]
sts-yymx4-wucs-0:1480:1480 [3] NCCL INFO group.cc:421 -> 3
sts-yymx4-wucs-0:1480:1480 [3] NCCL INFO group.cc:106 -> 3
sts-yymx4-wucs-0:1478:1478 [0] NCCL INFO comm 0x18941730 rank 1 nranks 4 cudaDev 1 busId 45000 - Abort COMPLETE
Traceback (most recent call last):
File "main.py", line 8, in <module>
c.paral_train(stringlist)
File "/data/data/ChemBart/ChemBart.py", line 72, in paral_train
sts-yymx4-wucs-0:1480:1480 [0] NCCL INFO comm 0x5ca13c00 rank 3 nranks 4 cudaDev 3 busId b9000 - Abort COMPLETE
self.BartNN = torch.nn.parallel.DistributedDataParallel(self.BartNN)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
Traceback (most recent call last):
File "main.py", line 8, in <module>
_verify_param_shape_across_processes(self.process_group, parameters)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
c.paral_train(stringlist)
File "/data/data/ChemBart/ChemBart.py", line 72, in paral_train
self.BartNN = torch.nn.parallel.DistributedDataParallel(self.BartNN)return dist._verify_params_across_processes(process_group, tensors, logger)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 1 failed (Connect)
_verify_param_shape_across_processes(self.process_group, parameters)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 3 failed (Connect)
sts-yymx4-wucs-0:1477:1532 [0] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local<60726>
sts-yymx4-wucs-0:1477:1532 [0] NCCL INFO transport/net_ib.cc:698 -> 6
sts-yymx4-wucs-0:1477:1532 [0] NCCL INFO include/net.h:27 -> 6
sts-yymx4-wucs-0:1477:1532 [0] NCCL INFO transport/net.cc:651 -> 6
sts-yymx4-wucs-0:1477:1532 [0] NCCL INFO proxy.cc:991 -> 6
sts-yymx4-wucs-0:1477:1532 [0] proxy.cc:1119 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 6
sts-yymx4-wucs-0:1477:1515 [0] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local<60655>
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO misc/socket.cc:546 -> 6
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO misc/socket.cc:558 -> 6
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO proxy.cc:881 -> 6
sts-yymx4-wucs-0:1477:1515 [0] proxy.cc:884 NCCL WARN Proxy Call to rank 0 failed (Connect)
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO transport/net.cc:315 -> 6
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO transport.cc:134 -> 6
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO init.cc:790 -> 6
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO init.cc:1089 -> 6
sts-yymx4-wucs-0:1477:1515 [0] NCCL INFO group.cc:64 -> 6 [Async thread]
sts-yymx4-wucs-0:1477:1477 [0] NCCL INFO group.cc:421 -> 3
sts-yymx4-wucs-0:1477:1477 [0] NCCL INFO group.cc:106 -> 3
sts-yymx4-wucs-0:1477:1477 [0] NCCL INFO comm 0x6e32680 rank 0 nranks 4 cudaDev 0 busId 41000 - Abort COMPLETE
Traceback (most recent call last):
File "main.py", line 8, in <module>
c.paral_train(stringlist)
File "/data/data/ChemBart/ChemBart.py", line 72, in paral_train
self.BartNN = torch.nn.parallel.DistributedDataParallel(self.BartNN)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)
sts-yymx4-wucs-0:1479:1529 [2] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local<49118>
sts-yymx4-wucs-0:1479:1529 [2] NCCL INFO transport/net_ib.cc:698 -> 6
sts-yymx4-wucs-0:1479:1529 [2] NCCL INFO include/net.h:27 -> 6
sts-yymx4-wucs-0:1479:1529 [2] NCCL INFO transport/net.cc:651 -> 6
sts-yymx4-wucs-0:1479:1529 [2] NCCL INFO proxy.cc:991 -> 6
sts-yymx4-wucs-0:1479:1529 [2] proxy.cc:1119 NCCL WARN [Proxy Service 2] Failed to execute operation Connect from rank 2, retcode 6
sts-yymx4-wucs-0:1479:1520 [2] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local<47727>
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO misc/socket.cc:546 -> 6
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO misc/socket.cc:558 -> 6
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO proxy.cc:881 -> 6
sts-yymx4-wucs-0:1479:1520 [2] proxy.cc:884 NCCL WARN Proxy Call to rank 2 failed (Connect)
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO transport/net.cc:315 -> 6
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO transport.cc:134 -> 6
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO init.cc:790 -> 6
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO init.cc:1089 -> 6
sts-yymx4-wucs-0:1479:1520 [2] NCCL INFO group.cc:64 -> 6 [Async thread]
sts-yymx4-wucs-0:1479:1479 [2] NCCL INFO group.cc:421 -> 3
sts-yymx4-wucs-0:1479:1479 [2] NCCL INFO group.cc:106 -> 3
sts-yymx4-wucs-0:1479:1479 [0] NCCL INFO comm 0x17754240 rank 2 nranks 4 cudaDev 2 busId b5000 - Abort COMPLETE
Traceback (most recent call last):
File "main.py", line 8, in <module>
c.paral_train(stringlist)
File "/data/data/ChemBart/ChemBart.py", line 72, in paral_train
self.BartNN = torch.nn.parallel.DistributedDataParallel(self.BartNN)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 2 failed (Connect)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1477) of binary: /data/data/anaconda3/envs/pytorch/bin/python
Traceback (most recent call last):
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-11-13_12:12:14
host : sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1478)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-11-13_12:12:14
host : sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1479)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-11-13_12:12:14
host : sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 1480)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-13_12:12:14
host : sts-yymx4-wucs-0.sts-yymx4-wucs.kym.svc.cluster.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1477)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================