Actually I am abit confused about this. I understand that I should set WORLD-SIZE as Number of nodes i.e 2. I am not sure what I should set as RANK but I set it as 0 and 1 for the two nodes. For the init_process_group` I pass each of the GPUs as in 0, 1, 2 ,3. Something like this.
If you have 8 processes (4 processes per node with 2 node), world_size should be 8 for init_process_group. But you don’t need to set that, as the launcher script will set the env vars for you properly.
I am not sure what I should set as RANK but I set it as 0 and 1 for the two nodes
There are two ranks here:
node rank: this is what you provide for --node_rank to the launcher script, and it is correct to set it to 0 and 1 for the two nodes.
process rank: this rank should be --node_rank X --nproc_per_node + local GPU id, which should be 0~3 for the four processes in the first node, and 4~7 for the four processes in the second node. But you don’t need to set this for init_process_group either, as the launcher script should have set the env var for you.
With the params you provided to the launcher script, the following should be sufficient to init the process group.
dist.init_process_group(backend)
If you also need the local rank for DDP, you will need to parse it from the arg, and then pass it to the DDP constructor. Sth like:
model = torch.nn.parallel.DistributedDataParallel(model,
device_ids=[arg.local_rank],
output_device=arg.local_rank)