dist.get_rank() returns either 0 or 1 even though I launch the training with
python -m torch.distributed.launch \ --nproc_per_node=4 \ --nnodes=2 \ --node_rank=0 \ --master_addr="$MASTER_ADDR" \ --master_port="$MASTER_PORT" \ train_dist.py &
I am using 2 nodes each with 4 GPUs.
- What parameters did you pass to
init_process_group invocation in
- Can you check if
WORLD_SZIE are set properly for each process?
Sudarshan wrote a great example of how to use launcher.py, which might be helpful to you.
Actually I am abit confused about this. I understand that I should set
WORLD-SIZE as Number of nodes i.e 2. I am not sure what I should set as
RANK but I set it as 0 and 1 for the two nodes. For the
init_process_group` I pass each of the GPUs as in 0, 1, 2 ,3. Something like this.
def init_processes(rank, size, fn, backend='nccl'): dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size)
if __name__ == "__main__":
size = 4 processes =  for rank in range(size): p = Process(target=init_processes, args=(rank, size, run)) p.start() processes.append(p)
for p in processes:
If you have 8 processes (4 processes per node with 2 node), world_size should be 8 for
init_process_group. But you don’t need to set that, as the launcher script will set the env vars for you properly.
I am not sure what I should set as
RANK but I set it as 0 and 1 for the two nodes
There are two ranks here:
- node rank: this is what you provide for
--node_rank to the launcher script, and it is correct to set it to 0 and 1 for the two nodes.
- process rank: this rank should be
--nproc_per_node + local GPU id, which should be 0~3 for the four processes in the first node, and 4~7 for the four processes in the second node. But you don’t need to set this for
init_process_group either, as the launcher script should have set the env var for you.
With the params you provided to the launcher script, the following should be sufficient to init the process group.
If you also need the local rank for DDP, you will need to parse it from the arg, and then pass it to the DDP constructor. Sth like:
model = torch.nn.parallel.DistributedDataParallel(model,
Check out the readme in the launcher script.
Thank you very much. That is very clear now.
@ankahira Can you please share your train_dist.py please? I still have problem with using the luncher!