- This does not look correct to me. The
world_size
is the global size of the process group, which should be 4 in this case, butnprocs
should be 2 on each node as each node spawns two processes. - The
args
doesn’t seem to match the signaturedef example(rank, world_size)
. If you do this, it would pass (0, 0, 1) and (1, 0, 1) to the two processes on the first node, and (0, 2, 3) and (1, 2, 3) to the second node. But, it should be passing (0, 4), (1, 4), (2, 4), and (3, 4) toexample
Checkout themultiprocessing.spawn()
API doc.
To make it work on multiple nodes, you can modify the example
signature to sth like def example(local_rank, global_rank_offset, world_size)
, then use global_rank_offset + local_rank
as the rank for init_process_group
. Then you can spawn it like:
world_size = 4
global_rank_offset = 0 # this should be 2 for the other machine
mp.spawn(example,
args=(global_rank_offset, world_size),
nprocs=world_size,
join=True)
If you want to avoid manually configuring the rank, you can try the launch.py
script. See this example.