Distributed training in multinode

  1. This does not look correct to me. The world_size is the global size of the process group, which should be 4 in this case, but nprocs should be 2 on each node as each node spawns two processes.
  2. The args doesn’t seem to match the signature def example(rank, world_size). If you do this, it would pass (0, 0, 1) and (1, 0, 1) to the two processes on the first node, and (0, 2, 3) and (1, 2, 3) to the second node. But, it should be passing (0, 4), (1, 4), (2, 4), and (3, 4) to example Checkout the multiprocessing.spawn() API doc.

To make it work on multiple nodes, you can modify the example signature to sth like def example(local_rank, global_rank_offset, world_size), then use global_rank_offset + local_rank as the rank for init_process_group. Then you can spawn it like:

    world_size = 4
    global_rank_offset = 0 # this should be 2 for the other machine
    mp.spawn(example,
        args=(global_rank_offset, world_size),
        nprocs=world_size,
        join=True)

If you want to avoid manually configuring the rank, you can try the launch.py script. See this example.