Socket error - broken pipe during rendezvous

Hi folks, we’re encountering a socket error during rendezvous when running a job with 256 nodes.

we don’t see this issue when running with 128 nodes.

Any thoughts on what might be happening here? We don’t see an indication of hitting a limit so far. We’ve tried increasing somaxconn and tcp_max_syn_backlog to 65k

$ cat /proc/sys/net/core/somaxconn
65535

$ cat /proc/sys/net/ipv4/tcp_max_syn_backlog
65535 

Error log

Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14c89ad315e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5972b5e (0x14c8f7096b5e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5974130 (0x14c8f7098130 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x597487d (0x14c8f709887d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5975509 (0x14c8f7099509 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x1fb (0x14c8f709352b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xc0d379 (0x14c8ff418379 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x37e19d (0x14c8feb8919d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: /usr/bin/python() [0x58208f]
frame #9: _PyObject_MakeTpCall + 0x75 (0x549185 in /usr/bin/python)
frame #10: /usr/bin/python() [0x54cea7]
frame #11: _PyEval_EvalFrameDefault + 0x4c1b (0x5db55b in /usr/bin/python)
frame #12: /usr/bin/python() [0x54cd32]
frame #13: /usr/bin/python() [0x6f826c]
frame #14: /usr/bin/python() [0x6b917c]
frame #15: <unknown function> + 0x9caa4 (0x14c90b4efaa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x129c3c (0x14c90b57cc3c in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0530 16:59:36.546000 2277043 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1341] The node 'fs-mbz-gpu-377_2277043_0' has failed to send a keep-alive heartbeat to the rendezvous '4d06fc05-8e9d-465b-8e12-d9298c49d04f' due to an error of type RendezvousConnectionError.
[W530 16:59:36.633728218 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[fs-mbz-gpu-377]:34610, remote=[fs-mbz-gpu-014]:29502): Broken pipe
Exception raised from sendBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14c89ad315e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5972b5e (0x14c8f7096b5e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59743b8 (0x14c8f70983b8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5975b9e (0x14c8f7099b9e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x348 (0x14c8f7093678 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xc0d379 (0x14c8ff418379 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x37e19d (0x14c8feb8919d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python() [0x58208f]
frame #8: _PyObject_MakeTpCall + 0x75 (0x549185 in /usr/bin/python)
frame #9: /usr/bin/python() [0x54cea7]
frame #10: _PyEval_EvalFrameDefault + 0x4c1b (0x5db55b in /usr/bin/python)
frame #11: _PyObject_Call_Prepend + 0xc2 (0x54a9d2 in /usr/bin/python)
frame #12: /usr/bin/python() [0x5a3628]
frame #13: PyObject_Call + 0x6c (0x54b30c in /usr/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x4c1b (0x5db55b in /usr/bin/python)
frame #15: PyEval_EvalCode + 0x15b (0x5d58eb in /usr/bin/python)
frame #16: /usr/bin/python() [0x608b42]
frame #17: /usr/bin/python() [0x6b4e93]
frame #18: _PyRun_SimpleFileObject + 0x1aa (0x6b4bfa in /usr/bin/python)
frame #19: _PyRun_AnyFileObject + 0x4f (0x6b4a2f in /usr/bin/python)
frame #20: Py_RunMain + 0x3b5 (0x6bca95 in /usr/bin/python)
frame #21: Py_BytesMain + 0x2d (0x6bc57d in /usr/bin/python)
frame #22: <unknown function> + 0x2a1ca (0x14c90b47d1ca in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #23: __libc_start_main + 0x8b (0x14c90b47d28b in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #24: _start + 0x25 (0x657ce5 in /usr/bin/python)

W0530 16:59:36.567000 2277043 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2277530 closing signal SIGTERM
W0530 16:59:36.567000 2277043 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2277531 closing signal SIGTERM

pytorch versions within container

pytorch-triton             3.2.0+git4b3bb1f8b.nvinternal
torch                      2.7.0a0+79aa17489c.nv25.4
torch-geometric            2.6.1
torch_tensorrt             2.7.0a0
torchprofile               0.0.4
torchvision                0.22.0a0

we have found is the job does not fail when using any version of the NGC Pytorch container older than 24.11 (i.e. 24.10)

We are still trying to investigate why the change from PyTorch 2.5.0 to 2.6.0 causes this issue

Do you see any issues in NCCL_DEBUG=INFO?

Thanks @ptrblck, here’s the only output we’re seeing for NCCL INFO.

<hostname>:3484069:3484069 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
<hostname>:3484069:3484069 [0] NCCL INFO Bootstrap: Using eth0:10.24.2.33<0>
<hostname>:2737416:2737416 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
<hostname>:2737416:2737416 [0] NCCL INFO Bootstrap: Using eth0:10.24.2.40<0>
<hostname>:1979131:1979131 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
<hostname>:1979131:1979131 [0] NCCL INFO Bootstrap: Using eth0:10.24.2.123<0>
<hostname>:2811802:2811802 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
<hostname>:2811802:2811802 [0] NCCL INFO Bootstrap: Using eth0:10.24.0.88<0>
<hostname>:3171137:3171137 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
<hostname>:3171137:3171137 [0] NCCL INFO Bootstrap: Using eth0:10.24.2.5<0>
<hostname>:2724171:2724171 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
<hostname>:2724171:2724171 [0] NCCL INFO Bootstrap: Using eth0:10.24.2.160<0>
<hostname>:1994577:1994577 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
<hostname>:1994577:1994577 [0] NCCL INFO Bootstrap: Using eth0:10.24.3.32<0>
<hostname>:1279989:1279989 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
<hostname>:1279989:1279989 [0] NCCL INFO Bootstrap: Using eth0:10.24.3.75<0>

Thank you! I don’t see any raised warnings or errors so don’t know what’s causing the timeout.

Thanks @ptrblck! we were able to add some debug options for libuv. With the debug options enabled we see this log prior to the errors

[I604 04:33:38.563171637 TCPStoreLibUvBackend.cpp:136] [c10d - debug] Remote peer closed the connection.

debug options

export TORCH_DISTRIBUTED_DEBUG=DETAIL export TORCH_CPP_LOG_LEVEL=INFO export TORCH_CPP_LOG_COMPONENTS=c10d,TCPStore,TCPStoreLibUvBackend,socket export UV_DEBUG=1

snapshot of the log leading up to the error:

[I604 04:33:38.560878254 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:41176
[I604 04:33:38.562635932 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:56940
[I604 04:33:38.562835564 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:44632
[I604 04:33:38.562885367 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:33540
[I604 04:33:38.562907552 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:57970
[I604 04:33:38.562984441 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:36444
[I604 04:33:38.563171637 TCPStoreLibUvBackend.cpp:136] [c10d - debug] Remote peer closed the connection.
[I604 04:33:38.564688843 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:36444
[I604 04:33:38.564758771 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:33540
[I604 04:33:38.565029425 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:57970
[I604 04:33:38.565071486 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:44632
[I604 04:33:38.565432761 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:33664
[I604 04:33:38.566062087 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:47680
[I604 04:33:38.566496863 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:33540
[I604 04:33:38.566807991 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:44632
[I604 04:33:38.566902943 TCPStoreLibUvBackend.cpp:846] [c10d - trace] compareAndSet key:/torch.rendezvous.e0a58790-5e32-4503-b766-6c194d77b60b address:[hostname]:57970
[W604 04:33:38.449510883 TCPStore.cpp:115] [c10d] recvVector failed on SocketImpl(fd=4, addr=[hostname]:41176, remote=[head-node-hostname]:29502): failed to recv, got 0 bytes
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14e190a165e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5972b5e (0x14e1ecd7bb5e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5974130 (0x14e1ecd7d130 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x597487d (0x14e1ecd7d87d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5975509 (0x14e1ecd7e509 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x1fb (0x14e1ecd7852b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xc0d379 (0x14e1f50fd379 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x37e19d (0x14e1f486e19d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: /usr/bin/python() [0x58208f]
frame #9: _PyObject_MakeTpCall + 0x75 (0x549185 in /usr/bin/python)
frame #10: /usr/bin/python() [0x54cea7]
frame #11: _PyEval_EvalFrameDefault + 0x4c1b (0x5db55b in /usr/bin/python)
frame #12: /usr/bin/python() [0x54cd32]
[I604 04:33:38.567507446 TCPStoreLibUvBackend.cpp:136] [c10d - debug] Remote peer closed the connection.
frame #13: /usr/bin/python() [0x6f826c]
frame #14: /usr/bin/python() [0x6b917c]
frame #15: <unknown function> + 0x9caa4 (0x14e201271aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x129c3c (0x14e2012fec3c in /usr/lib/x86_64-linux-gnu/libc.so.6)```

oh if this is related to TCPStore, we recently had a fix to libuv backend. Can you try the pytorch nightly build and see if it help?

Thanks @fduwjj , is this the fix you mentioned? https://github.com/pytorch/pytorch/pull/153977
Upgrading to nightly may have resolved a similar issue for me.