From the document (Distributed communication package - torch.distributed — PyTorch 1.11.0 documentation) we can see there are two kinds of approaches that we can set up distributed training.
The first approach is to use
multiprocessing.spawn() approach within one python file.
The second approach is to use
I observed that there are slight different for initialize the process group.
For the first approach, it is
dist.init_process_group("nccl", rank=rank, world_size=2)
For the second approach, it is
As we can see the second approach specified the rank and world size, so what lead to such difference ?
I also tried to use
mp.spawn to start distributed training without specifying the rank and world size, and it yield error says
RuntimeError: Address already in use. So what lead to this problem ?