Torchrun vllm error:TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host

i have two server with GPU,i want to use tourchrun a vllm on two gpu server。
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 --rdzv-backend=c10d --rdzv-endpoint=192.168.0.13:29400 start_vllm.py --config /root/config.json

stop here,it’s like hung
(VllmWorkerProcess pid=1968773) INFO 03-02 19:51:52 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=1968801) INFO 03-02 19:51:52 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=1968800) INFO 03-02 19:51:52 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=1968791) (VllmWorkerProcess pid=1968799) INFO 03-02 19:51:52 cuda.py:160] Using Triton MLA backend.
INFO 03-02 19:51:52 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=1968783) (VllmWorkerProcess pid=1968790) WARNING 03-02 19:51:52 triton_decode_attention.py:44] The following error message ‘operation scheduled before its operands’ can be ignored.
WARNING 03-02 19:51:52 triton_decode_attention.py:44] The following error message ‘operation scheduled before its operands’ can be ignored.
WARNING 03-02 19:51:52 triton_decode_attention.py:44] The following error message ‘operation scheduled before its operands’ can be ignored.
(VllmWorkerProcess pid=1968773) WARNING 03-02 19:51:52 triton_decode_attention.py:44] The following error message ‘operation scheduled before its operands’ can be ignored.
(VllmWorkerProcess pid=1968801) WARNING 03-02 19:51:52 triton_decode_attention.py:44] The following error message ‘operation scheduled before its operands’ can be ignored.
(VllmWorkerProcess pid=1968800) WARNING 03-02 19:51:52 triton_decode_attention.py:44] The following error message ‘operation scheduled before its operands’ can be ignored.
(VllmWorkerProcess pid=1968799) (VllmWorkerProcess pid=1968791) WARNING 03-02 19:51:52 triton_decode_attention.py:44] The following error message ‘operation scheduled before its operands’ can be ignored.
WARNING 03-02 19:51:52 triton_decode_attention.py:44] The following error message ‘operation scheduled before its operands’ can be ignored.

then wait 10m,error eg.
[E302 20:01:08.812651775 socket.cpp:1011] [c10d] The client socket has timed out after 600000ms while trying to connect to (192.168.0.14, 54273).
[W302 20:01:08.813334394 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 192.168.0.14:54273 - retrying (try=0, timeout=600000ms, delay=66616ms): The client socket has timed out after 600000ms while trying to connect to (192.168.0.14, 54273).
Exception raised from throwTimeoutError at …/torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f7cc9f6c446 in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x15e04c6 (0x7f7cb514d4c6 in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x6029d95 (0x7f7cb9b96d95 in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x6029f36 (0x7f7cb9b96f36 in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: + 0x602a3a4 (0x7f7cb9b973a4 in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: + 0x5fe8016 (0x7f7cb9b55016 in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::TCPStore(std::string, c10d::TCPStoreOptions const&) + 0x20c (0x7f7cb9b57f7c in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0xd9acdd (0x7f7cc953acdd in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x4cb474 (0x7f7cc8c6b474 in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x15c31e (0x55e128ee731e in /root/deepseek-env/bin/python3)
frame #10: _PyObject_MakeTpCall + 0x25b (0x55e128edde4b in /root/deepseek-env/bin/python3)
frame #11: + 0x16a9f0 (0x55e128ef59f0 in /root/deepseek-env/bin/python3)
frame #12: + 0x166de7 (0x55e128ef1de7 in /root/deepseek-env/bin/python3)
frame #13: + 0x1531fb (0x55e128ede1fb in /root/deepseek-env/bin/python3)
frame #14: + 0x4c9ccb (0x7f7cc8c69ccb in /root/deepseek-env/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: _PyObject_MakeTpCall + 0x25b (0x55e128edde4b in /root/deepseek-env/bin/python3)
frame #16: _PyEval_EvalFrameDefault + 0x6542 (0x55e128ed64f2 in /root/deepseek-env/bin/python3)
frame #17: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #18: _PyEval_EvalFrameDefault + 0x6c5 (0x55e128ed0675 in /root/deepseek-env/bin/python3)
frame #19: + 0x2030a5 (0x55e128f8e0a5 in /root/deepseek-env/bin/python3)
frame #20: + 0x15cdc9 (0x55e128ee7dc9 in /root/deepseek-env/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x6c5 (0x55e128ed0675 in /root/deepseek-env/bin/python3)
frame #22: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #23: PyObject_Call + 0x122 (0x55e128ef6262 in /root/deepseek-env/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x289f (0x55e128ed284f in /root/deepseek-env/bin/python3)
frame #25: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #26: PyObject_Call + 0x122 (0x55e128ef6262 in /root/deepseek-env/bin/python3)
frame #27: _PyEval_EvalFrameDefault + 0x289f (0x55e128ed284f in /root/deepseek-env/bin/python3)
frame #28: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #29: _PyEval_EvalFrameDefault + 0x19b6 (0x55e128ed1966 in /root/deepseek-env/bin/python3)
frame #30: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #31: _PyEval_EvalFrameDefault + 0x6c5 (0x55e128ed0675 in /root/deepseek-env/bin/python3)
frame #32: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #33: _PyEval_EvalFrameDefault + 0x6c5 (0x55e128ed0675 in /root/deepseek-env/bin/python3)
frame #34: + 0x16a821 (0x55e128ef5821 in /root/deepseek-env/bin/python3)
frame #35: _PyEval_EvalFrameDefault + 0x289f (0x55e128ed284f in /root/deepseek-env/bin/python3)
frame #36: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #37: _PyEval_EvalFrameDefault + 0x6c5 (0x55e128ed0675 in /root/deepseek-env/bin/python3)
frame #38: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #39: PyObject_Call + 0x122 (0x55e128ef6262 in /root/deepseek-env/bin/python3)
frame #40: _PyEval_EvalFrameDefault + 0x289f (0x55e128ed284f in /root/deepseek-env/bin/python3)
frame #41: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #42: _PyEval_EvalFrameDefault + 0x8b6 (0x55e128ed0866 in /root/deepseek-env/bin/python3)
frame #43: + 0x16a5c1 (0x55e128ef55c1 in /root/deepseek-env/bin/python3)
frame #44: _PyEval_EvalFrameDefault + 0x19b6 (0x55e128ed1966 in /root/deepseek-env/bin/python3)
frame #45: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #46: _PyEval_EvalFrameDefault + 0x8b6 (0x55e128ed0866 in /root/deepseek-env/bin/python3)
frame #47: _PyObject_FastCallDictTstate + 0xc4 (0x55e128edcfd4 in /root/deepseek-env/bin/python3)
frame #48: + 0x1667c4 (0x55e128ef17c4 in /root/deepseek-env/bin/python3)
frame #49: _PyObject_MakeTpCall + 0x1fc (0x55e128edddec in /root/deepseek-env/bin/python3)
frame #50: _PyEval_EvalFrameDefault + 0x6542 (0x55e128ed64f2 in /root/deepseek-env/bin/python3)
frame #51: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #52: _PyEval_EvalFrameDefault + 0x61a2 (0x55e128ed6152 in /root/deepseek-env/bin/python3)
frame #53: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x8b6 (0x55e128ed0866 in /root/deepseek-env/bin/python3)
frame #55: _PyObject_FastCallDictTstate + 0xc4 (0x55e128edcfd4 in /root/deepseek-env/bin/python3)
frame #56: + 0x1667c4 (0x55e128ef17c4 in /root/deepseek-env/bin/python3)
frame #57: _PyObject_MakeTpCall + 0x1fc (0x55e128edddec in /root/deepseek-env/bin/python3)
frame #58: _PyEval_EvalFrameDefault + 0x6542 (0x55e128ed64f2 in /root/deepseek-env/bin/python3)
frame #59: _PyFunction_Vectorcall + 0x7c (0x55e128ee7b6c in /root/deepseek-env/bin/python3)
frame #60: _PyEval_EvalFrameDefault + 0x8b6 (0x55e128ed0866 in /root/deepseek-env/bin/python3)
frame #61: + 0x16a5c1 (0x55e128ef55c1 in /root/deepseek-env/bin/python3)
frame #62: PyObject_Call + 0x122 (0x55e128ef6262 in /root/deepseek-env/bin/python3)

Recurring error,about over 50 times,then

Traceback (most recent call last):
File “/root/start_vllm.py”, line 54, in
main()
File “/root/start_vllm.py”, line 30, in main
llm = LLM(
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/utils.py”, line 1022, in inner
return fn(*args, **kwargs)
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/entrypoints/llm.py”, line 242, in init
self.llm_engine = self.engine_class.from_engine_args(
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/engine/llm_engine.py”, line 489, in from_engine_args
engine = cls(
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/engine/llm_engine.py”, line 273, in init
self.model_executor = executor_class(vllm_config=vllm_config, )
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/executor/executor_base.py”, line 271, in init
super().init(*args, **kwargs)
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/executor/executor_base.py”, line 52, in init
self._init_executor()
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py”, line 124, in _init_executor
self._run_workers(“init_device”)
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py”, line 185, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/utils.py”, line 2196, in run_method
return func(*args, **kwargs)
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/worker/worker.py”, line 166, in init_device
init_worker_distributed_environment(self.vllm_config, self.rank,
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/worker/worker.py”, line 504, in init_worker_distributed_environment
init_distributed_environment(parallel_config.world_size, rank,
File “/root/deepseek-env/lib/python3.10/site-packages/vllm/distributed/parallel_state.py”, line 819, in init_distributed_environment
torch.distributed.init_process_group(
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py”, line 83, in wrapper
return func(*args, **kwargs)
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py”, line 97, in wrapper
func_return = func(*args, **kwargs)
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py”, line 1520, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/rendezvous.py”, line 221, in _tcp_rendezvous_handler
store = _create_c10d_store(
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/rendezvous.py”, line 185, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (192.168.0.14, 54779).
ERROR 03-02 20:11:30 multiproc_worker_utils.py:124] Worker VllmWorkerProcess pid 2019556 died, exit code: -15
INFO 03-02 20:11:30 multiproc_worker_utils.py:128] Killing local vLLM worker processes
W0302 20:11:31.454276 2019108 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2019182 closing signal SIGTERM
W0302 20:11:31.455976 2019108 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2019183 closing signal SIGTERM
W0302 20:11:31.457252 2019108 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2019184 closing signal SIGTERM
W0302 20:11:31.458984 2019108 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2019185 closing signal SIGTERM
W0302 20:11:31.460362 2019108 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2019186 closing signal SIGTERM
W0302 20:11:31.461746 2019108 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2019187 closing signal SIGTERM
W0302 20:11:31.463161 2019108 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2019188 closing signal SIGTERM
E0302 20:11:37.014529 2019108 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2019181) of binary: /root/deepseek-env/bin/python3
Traceback (most recent call last):
File “/root/deepseek-env/bin/torchrun”, line 8, in
sys.exit(main())
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 355, in wrapper
return f(*args, **kwargs)
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/run.py”, line 919, in main
run(args)
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/run.py”, line 910, in run
elastic_launch(
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/root/deepseek-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

start_vllm.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-03-02_20:11:31
host : h20-2
rank : 8 (local_rank: 0)
exitcode : 1 (pid: 2019181)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.6 documentation

can someone help,tkx!

1 Like

I encountered a similar issue. What’s your vllm/ray version?

vllm 0.7.3
ray 2.40.0