Questions on underlying port restrictions in nccl/gloo communication

Hi,

I have a multi-node task residing on a cluster, and the nodes often failed to do operations like reduce (they hanged there forever). I checked with the network team experts and they told me that it’s because nccl/gloo is using port 0 to be bound with some extra sockets (in addition to the specified MASTER_PORT ), and there is an allowed port range on that cluster so when that port 0 (arbitrary port) falls out of the range, the connection stuck.

Now, to fix this within the pytorch framework: are there any ways to specify this “under-the-hood” port (in addition to the MASTER_PORT ) so that I can make sure that all the port connections are within the allowed range on the cluster?

Thanks!

I saw this init_method=“tcp://10.0.0.1:8888” in some other posts (link). Will something like that work?

init_method would probably work, but can i understand the question more? why would specifying MASTER_PORT not work? I think these two serves similar purposes.

I did use MASTER_PORT/MASTER_ADD to control the communication port. What I heard from the network team experts is that there are other communication channels going on by the distributed backend that use port 0 (meaning: arbitrary port) which got blocked randomly. My question is how to control that part?

Gloo binds one port per rank to perform p2p communication, it’s not easy to override it since you can have multiple ranks per host.

Unfortunately, we don’t expose in PyTorch any configuration of Gloo’s port binding behavior.

This seems like a problem that could be solved by better configuring your cluster. Is that an option?

The network team is working on that but probably not within the near future. Does what you said apply to nccl backend as well? Thanks!

NCCL performs p2p communication as well with either sockets or infiniband. @kwen2501 should be able to better answer you about it.

Hi there,
MASTER_PORT is for PyTorch only, it does not apply to NCCL.
To force NCCL to use a specific port on the master rank, you can use NCCL’s environment variable:
NCCL_COMM_ID=<ip>:<port>
For example, NCCL_COMM_ID=192.168.0.1:23456

I should note that fixing port via NCCL_COMM_ID has its own side effect:
if you use the same port for two consecutive jobs, it could sometimes happen that the OS hasn’t released the port after the first job exits and when the second job starts, hence the second job may complain “Port already in use.”

Thank you for the super helpful info! I have a few questions: in the env setting NCCL_COMM_ID=<ip>:<port>

  1. is the <ip> still the IP of the master (rank 0) machine? (It looks so based on the context, asking to confirm.)
  2. does the <port> need to be
    a. the same as the pytorch MASTER_PORT ?
    b. an arbitrary one but different from the MASTER_PORT ? Or
    c. a completely arbitrary one? (modulo some port range constraints required by the network setup which is why this post/question was brought up in the first place)
  3. I assume the <ip>:<port> still needs to be consistent across all nodes. Is that correct? (Still a fixed choice from the master node.)

All good questions :slight_smile:

  1. Yes, usually that’s the case.

  2. Please pick a different port than MASTER_PORT ; otherwise there would be collision. And yes, it must be a usable port in your network setup.

  3. Yes, the same NCCL_COMM_ID must be set on all nodes.

1 Like

Thank you very much!

Hi Ke, I’m still running into issues. I tested it with two nodes (dedicated machines that do not have that port restriction), and I kept seeing something like the following:

... in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: <some hpp file> unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote perr, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

My command is like the following:

NCCL_COMM_ID=<master_addr>:12345 torchrun --node_rank=<0|1> --nnodes=2 --master_addr=<master_addr> --master_port=23456 --nproc_per_node=gpu <some_python_script.py> <script_args...>

Removing the first NCCL_COMM_ID env then it’ll work fine (though it won’t work on the cluster with the port restrictions).

I also tried to use NCCL_DEBUG=INFO. It says No interface found in the same subnet as remote address <mast_addr>:<12345> and No usable listening interface found

Am I missing something?

Can you log into those nodes and verify that you can bind to the specified port and connect from the other node to it?

It might be that there’s something in the network configuration that you’re not accounting for.

I think the port should be OK. I tried to use the same port in --master_port and it worked fine. Only when I use it in NCCL_COMM_ID (and change --master_port to a different working one) will it stop communicating.

I think I get a bit grasp of what’s going on here. I took a deeper look at the NCCL_INFO logs, and found the following:

Suppose I have two nodes (I used variable name to denote all IP/port numbers for convenience):
rank0 (master), with ip0
rank1, with ip1
Now, I start the process in both nodes with –master_port=mpt (e.g. mpt=31112), and NCCL_COMM_ID=ip0:npt (where npt is a different port number than mpt, say npt=32111). After the process started, I saw something as follows in the console:

rank0 info log:

NCCL INFO Bootstrap : Using eth0:ip0<0>

(note that it’s a port 0 in the <> enclosed area)

rank1 info log:

misc/socket.cc:191 NCCL WARN Net : No interface found in the same subnet as remote address ip0:<npt>

(note that the port is the selected NCCL port npt, not 0).

So somehow, on the rank0 node, although I have specified to use the NCCL port npt, the process seems still using port 0 in the NCCL communication. I think this observation is related to the issue. What do you folks think?

I also find something else interesting: if I just set NCCL_COMM_ID=<host's own ip>:<some port>. The process will work ok (communication works), and the info log just says

NCCL NET/Socket : Using [0]eth0:<host's own ip><0>
...
NCCL INFO Bootstrap : Using eth0:<host's own ip><0>

Same as the case where I don’t specify the NCCL_COMM_ID env var. Overall I didn’t see the NCCL_COMM_ID’s port number appearing anywhere in the info log (using NCCL_DEBUG=INFO), except that when the node1 fails, complaining that it cannot find some interface subnet thing at rank0_ip<nccl_port>

This issue may be that host 1 is not in the same subnet as the IP address specified in NCCL_COMM_ID. It is not about the port specified in NCCL_COMM_ID.

For example, if host 0 is in subnet 192.168.1.xx, and host 1 is in subnet 10.0.1.xx, this won’t work.

I believe they are in the same subnet. The IPv4 address of the two test nodes only differ in the last 8bit, i.e. the two IPs are a1.a2.a3.b1 and a1.a2.a3.b2 respectively. This indicates that they are in the same subnet.

Maybe you need check net mask use ifconfig, NCCL use net mask check whether two IPs are in same subnet