What is the purpose of `torch.distributed.launch`?

In the script, it describes using python -m torch.distributed.launch........ to spawn the processes but I see that the Pytorch ImageNet example does not use it and is able to spawn the processes too, so what’s the point of it? I see that a lot of 3rd party open-source training repo also call torch.distributed.launch in their bash script for training. I am not very clear on what it does, seems like it setting the environment variables?

Pytorch imagenet example without the usage of the launch file: examples/imagenet at master · pytorch/examples · GitHub

It will be very nice if there’s an example of using the launch file in the pytorch example to see what’s the difference.

Can anyone comfirm that what shown here Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.10.0+cu111 documentation.

Is actually a different way to do the same we can do using torch.distributed.launch ?

torch.distributed.launch is a CLI tool that helps you create k copies of your training script (one on each process). And as you correctly pointed out it sets certain env vars that ddp uses to get information about rank, world size and so on. The closest analogy is what mpirun is to mpi applications. torch.distributed.launch compliments ddp by making it easy for you to run your ddp training script.

It is one way to launch ddp scripts but not the only way.

How does it make it easier? Is there a pytorch example over the ImageNet Pytorch example that shows how torch.distributed.launch could be a better option than just calling the arguments directly like for main.py in imagenet example?