I’m trying to set up pytorch with slurm and nccl.
I have 2 nodes, each with one GPU.
There is an ethernet and infiniband connection between the two nodes.
There is also a separate ethernet connection on the master node with its public address.
This is the file I’m using to launch a job.
echo Node IP: $head_node_ip
I’m launching it with ‘sbatch run.sh’
The address of the head node that the second node can access is 192.168.0.1. The second node does not have public internet access. It’s only network interfaces are an ethernet and infiniband connection to the head node.
Running this fails to create the c10d store.
If I change head_node_ip to localhost, it creates the store, but then gets stuck on ‘Rendezvous’ing worker group’.
If I change head_node_ip to localhost and only run it on the head node, then it successfully runs the job.
Using localhost also uses the public interface, which the secondary node cannot connect to.
Part of this issue seems to have something to do with torchrun only creating a store on ipv6.
If I run ‘netstat -ntlp’ while the job is running, python is only listening on ipv6.
The secondary node does have an ipv6 address though, but if the master node ip is ipv4, then the secondary node will try to connect over ipv4, which will fail.
If I try to use the ipv6 address of the head node, then it fails with ‘the ipv4 network addresses cannot be retrieved’.
Is there a way to force torchrun to open a c10d store on a specific interface or address? I believe that would solve my issue.
I have confirmed that I can open tcp port 29500 on the head node and connect to it from the second node using netcat.