From the document (Distributed communication package - torch.distributed — PyTorch 1.11.0 documentation) we can see there are two kinds of approaches that we can set up distributed training.
The first approach is to use multiprocessing.spawn()
approach within one python file.
The second approach is to use torchrun
or torch.distributed.launch
.
I observed that there are slight different for initialize the process group.
For the first approach, it is
dist.init_process_group("nccl", rank=rank, world_size=2)
For the second approach, it is
torch.distributed.init_process_group(backend='nccl',
init_method='env://')
As we can see the second approach specified the rank and world size, so what lead to such difference ?
I also tried to use mp.spawn
to start distributed training without specifying the rank and world size, and it yield error says RuntimeError: Address already in use
. So what lead to this problem ?