Trouble launching on multiple nodes

bobby3605 · February 14, 2023, 9:24pm

I’m trying to set up pytorch with slurm and nccl.
I have 2 nodes, each with one GPU.
There is an ethernet and infiniband connection between the two nodes.
There is also a separate ethernet connection on the master node with its public address.
This is the file I’m using to launch a job.

run.sh

#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --time=1-00:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=elastic_ddp_test

head_node_ip=192.168.0.1
head_node_port=29500

echo Node IP: $head_node_ip
export LOGLEVEL=DEBUG

elastic_ddp_test="
–nnodes=2
–nproc_per_node=1
–rdzv_id=100
–rdzv_backend=c10d
–rdzv_endpoint=$head_node_ip:$head_node_port
–master_addr $head_node_ip
–master_port $head_node_port
elastic_ddp_test.py"

torchrun $elastic_ddp_test

I’m launching it with ‘sbatch run.sh’
The address of the head node that the second node can access is 192.168.0.1. The second node does not have public internet access. It’s only network interfaces are an ethernet and infiniband connection to the head node.
Running this fails to create the c10d store.
If I change head_node_ip to localhost, it creates the store, but then gets stuck on ‘Rendezvous’ing worker group’.
If I change head_node_ip to localhost and only run it on the head node, then it successfully runs the job.
Using localhost also uses the public interface, which the secondary node cannot connect to.
Part of this issue seems to have something to do with torchrun only creating a store on ipv6.
If I run ‘netstat -ntlp’ while the job is running, python is only listening on ipv6.
The secondary node does have an ipv6 address though, but if the master node ip is ipv4, then the secondary node will try to connect over ipv4, which will fail.
If I try to use the ipv6 address of the head node, then it fails with ‘the ipv4 network addresses cannot be retrieved’.
Is there a way to force torchrun to open a c10d store on a specific interface or address? I believe that would solve my issue.
I have confirmed that I can open tcp port 29500 on the head node and connect to it from the second node using netcat.

H-Huang · February 21, 2023, 3:03pm

You can configure this with --rdvz_enpoint. I would also first suggest that you try to write a simple script that creates a store and have both nodes run that to make sure they can connect without torchrun.

Loubna_ben_allal · May 25, 2023, 2:04pm

Hi, If I launch a job without specifying the --rdv_backend, which backend is used by default?
I am running a training with this lanucher on multiple nodes

export LAUNCHER="torchrun \
     --nproc_per_node $GPUS_PER_NODE \
     --nnodes $NNODES \
     --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d \
     --max_restarts 0 \
     --tee 3 \
     "

But it fails with error:

torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
[E socket.cpp:860] [c10d] The client socket has timed out after 60s while trying to connect to (MASTER ADDR, Port)

If I remove --rdzv_backend c10d the training runs successfully (also note that the nodes don"t have access to internet) is there a reason this causes failure and will removing this flag impact my training in any way?