Launching torch.ddp with one command

I typically see a ddp script being launched by submitting multiple commands (one per process), e.g.:

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=3 --node_rank=0 --master_addr=127.0.0.1 --master_port=12345
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=3 --node_rank=1 --master_addr=10.47.164.34 --master_port=12345
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=3 --node_rank=2 --master_addr=10.47.164.34 --master_port=12345
torch.distributed.init_process_group(backend="nccl", init_method="env://")

However, I am using a cluster-management system and the admin would prefer I submit only command and hence it would have to be the same command.

Are there any examples of maybe using mpiexec (just to submit the command) or anything else - so that master, slaves, etc are created automatically?

I’m assuming you mean you’d like to use the same command on all the nodes to spawn the processes. You can use this command on all nodes, but we need to do something to handle the rank:

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=3 --node_rank=<rank> --master_addr=10.47.164.34 --master_port=12345

You probably need some way of passing the appropriate unique rank in. Does your cluster-management system allow for maybe using environment variables which are different per nodes? If so, you can pass in the node_rank via an environment variable.

Another option might be using TorchElastic (which is a fault tolerant wrapper around DDP): https://pytorch.org/elastic/0.2.1/quickstart.html. TorchElastic figures out the rank automatically, so you can use the same command on all nodes.