Difference between two kinds of distributed training paradigm

From the document (Distributed communication package - torch.distributed — PyTorch 1.11.0 documentation) we can see there are two kinds of approaches that we can set up distributed training.

The first approach is to use multiprocessing.spawn() approach within one python file.

The second approach is to use torchrun or torch.distributed.launch.

I observed that there are slight different for initialize the process group.

For the first approach, it is

dist.init_process_group("nccl", rank=rank, world_size=2)

For the second approach, it is

torch.distributed.init_process_group(backend='nccl',
                                     init_method='env://')

As we can see the second approach specified the rank and world size, so what lead to such difference ?

I also tried to use mp.spawn to start distributed training without specifying the rank and world size, and it yield error says RuntimeError: Address already in use. So what lead to this problem ?

My guess is that torchrun automatically sets the appropriate rank and world_size for you. cc @Kiuk_Chung to confirm.

This is likely because your previous training run might still be running and using the old port. Or the port specified in MASTER_PORT is being used by something else.

Yes @pritamdamania87 is correct. torch.multiprocessing.spawn() is a more lower level API. In fact torchrun uses torch.multiprocessing under the hood to invoke multiple copies of the user’s main method when the main method is passed as a function.

Since torchrun is a higher level tool, it sets up all the context and environment variables that torch.distribued.init_process_group() expects so that as a user, you only need to pass the backend parameter. (FYI - you don’t even need to pass init_method since env:// is the default one).

In short, use torchrun when you can, and use torch.multiprocessing.spawn() when you need to.