What will be the correct world_size in the training with multiple nodes?

devsy · August 31, 2023, 9:03pm

Hi,
I was running DDP tutorial on two machines. (ddp tutorial- Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.0.1+cu117 documentation)

I used torchrun, with following command
torchrun --nnodes=2 --nproc_per_node=$NUM_GPUS --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 ddp_test_elastic.py

when I printed dist.world_size() on node 1, it says 4 but on node 2, it says 6 (node_rank=0 has 4 gpus and node_rank=1 has 2 gpus in my setup)

Is this right?
I think it is correct for the second node, which prints 6, but maybe wrong for the first node.
If the first node is aware of the second node, then shouldn’t it be 6?