GLOO/NCCL connection issues [build from source]

Thanks! Building from HEAD to include the PR, I got:

(31286) ~ $ export TORCH_DISTRIBUTED_DETAIL=DEBUG
(31286) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(31286) ~ $ echo $PORT
32967
(31286) ~ $ export BACKEND=gloo
(31286) ~ $ srun python ddp_torch.py
[W socket.cpp:634] The server socket on [localhost]:32967 is not yet listening (generic error: 111 - Connection refused).
terminate called after throwing an instance of 'std::system_error'
  what():  Connection reset by peer
Using backend: gloo
my rank = 1  my size = 2

It worked partially, as it is missing rank 0.
With NCCL backend, no information is printed, only the error.

I’ll create an issue then.
Issue : GLOO/NCCL connection issues [build from source] · Issue #69003 · pytorch/pytorch · GitHub