How to correctly launch multi-node training

dist.get_rank() returns either 0 or 1 even though I launch the training with

python -m torch.distributed.launch \ --nproc_per_node=4 \ --nnodes=2 \ --node_rank=0 \ --master_addr="$MASTER_ADDR" \ --master_port="$MASTER_PORT" \ train_dist.py &

I am using 2 nodes each with 4 GPUs.

Hey @ankahira

Two questions:

  1. What parameters did you pass to init_process_group invocation in train_dist.py?
  2. Can you check if RANK and WORLD_SZIE are set properly for each process?

Sudarshan wrote a great example of how to use launcher.py, which might be helpful to you.

Actually I am abit confused about this. I understand that I should set WORLD-SIZE as Number of nodes i.e 2. I am not sure what I should set as RANK but I set it as 0 and 1 for the two nodes. For the init_process_group` I pass each of the GPUs as in 0, 1, 2 ,3. Something like this.

def init_processes(rank, size, fn, backend='nccl'): dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size)

if __name__ == "__main__":
size = 4 processes = [] for rank in range(size): p = Process(target=init_processes, args=(rank, size, run)) p.start() processes.append(p)

for p in processes:
    p.join()`

If you have 8 processes (4 processes per node with 2 node), world_size should be 8 for init_process_group. But you don’t need to set that, as the launcher script will set the env vars for you properly.

I am not sure what I should set as RANK but I set it as 0 and 1 for the two nodes

There are two ranks here:

  • node rank: this is what you provide for --node_rank to the launcher script, and it is correct to set it to 0 and 1 for the two nodes.
  • process rank: this rank should be --node_rank X --nproc_per_node + local GPU id, which should be 0~3 for the four processes in the first node, and 4~7 for the four processes in the second node. But you don’t need to set this for init_process_group either, as the launcher script should have set the env var for you.

With the params you provided to the launcher script, the following should be sufficient to init the process group.

dist.init_process_group(backend)

If you also need the local rank for DDP, you will need to parse it from the arg, and then pass it to the DDP constructor. Sth like:

    model = torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[arg.local_rank],
                                                      output_device=arg.local_rank)

Check out the readme in the launcher script.

Thank you very much. That is very clear now.

@ankahira Can you please share your train_dist.py please? I still have problem with using the luncher!