I have a multi-node task residing on a cluster, and the nodes often failed to do operations like reduce (they hanged there forever). I checked with the network team experts and they told me that it’s because nccl/gloo is using port 0 to be bound with some extra sockets (in addition to the specified MASTER_PORT ), and there is an allowed port range on that cluster so when that port 0 (arbitrary port) falls out of the range, the connection stuck.
Now, to fix this within the pytorch framework: are there any ways to specify this “under-the-hood” port (in addition to the MASTER_PORT ) so that I can make sure that all the port connections are within the allowed range on the cluster?
I did use MASTER_PORT/MASTER_ADD to control the communication port. What I heard from the network team experts is that there are other communication channels going on by the distributed backend that use port 0 (meaning: arbitrary port) which got blocked randomly. My question is how to control that part?
Hi there, MASTER_PORT is for PyTorch only, it does not apply to NCCL.
To force NCCL to use a specific port on the master rank, you can use NCCL’s environment variable: NCCL_COMM_ID=<ip>:<port>
For example, NCCL_COMM_ID=192.168.0.1:23456
I should note that fixing port via NCCL_COMM_ID has its own side effect:
if you use the same port for two consecutive jobs, it could sometimes happen that the OS hasn’t released the port after the first job exits and when the second job starts, hence the second job may complain “Port already in use.”
Thank you for the super helpful info! I have a few questions: in the env setting NCCL_COMM_ID=<ip>:<port>
is the <ip> still the IP of the master (rank 0) machine? (It looks so based on the context, asking to confirm.)
does the <port> need to be
a. the same as the pytorch MASTER_PORT ?
b. an arbitrary one but different from the MASTER_PORT ? Or
c. a completely arbitrary one? (modulo some port range constraints required by the network setup which is why this post/question was brought up in the first place)
I assume the <ip>:<port> still needs to be consistent across all nodes. Is that correct? (Still a fixed choice from the master node.)
Hi Ke, I’m still running into issues. I tested it with two nodes (dedicated machines that do not have that port restriction), and I kept seeing something like the following:
... in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: <some hpp file> unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote perr, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
I think the port should be OK. I tried to use the same port in --master_port and it worked fine. Only when I use it in NCCL_COMM_ID (and change --master_port to a different working one) will it stop communicating.
I think I get a bit grasp of what’s going on here. I took a deeper look at the NCCL_INFO logs, and found the following:
Suppose I have two nodes (I used variable name to denote all IP/port numbers for convenience): rank0 (master), with ip0 rank1, with ip1
Now, I start the process in both nodes with –master_port=mpt (e.g. mpt=31112), and NCCL_COMM_ID=ip0:npt (where npt is a different port number than mpt, say npt=32111). After the process started, I saw something as follows in the console:
rank0 info log:
NCCL INFO Bootstrap : Using eth0:ip0<0>
(note that it’s a port 0 in the <> enclosed area)
rank1 info log:
misc/socket.cc:191 NCCL WARN Net : No interface found in the same subnet as remote address ip0:<npt>
(note that the port is the selected NCCL port npt, not 0).
So somehow, on the rank0 node, although I have specified to use the NCCL port npt, the process seems still using port 0 in the NCCL communication. I think this observation is related to the issue. What do you folks think?
I also find something else interesting: if I just set NCCL_COMM_ID=<host's own ip>:<some port>. The process will work ok (communication works), and the info log just says
NCCL NET/Socket : Using eth0:<host's own ip><0>
NCCL INFO Bootstrap : Using eth0:<host's own ip><0>
Same as the case where I don’t specify the NCCL_COMM_ID env var. Overall I didn’t see the NCCL_COMM_ID’s port number appearing anywhere in the info log (using NCCL_DEBUG=INFO), except that when the node1 fails, complaining that it cannot find some interface subnet thing at rank0_ip<nccl_port>
I believe they are in the same subnet. The IPv4 address of the two test nodes only differ in the last 8bit, i.e. the two IPs are a1.a2.a3.b1 and a1.a2.a3.b2 respectively. This indicates that they are in the same subnet.