Torch.distributed.launch vs torch.multiprocessing.spawn

What is the implementation and performance differences between torch.distributed.launch and torch.multiprocessing.spawn?

torch.distributed.launch uses subprocess.Popen. The perf differences between these two are typical multiprocessing vs subprocess

Besides that, torch.distributed.launch also tries to configure several env vars and pass command line arguments for distributed training script, e.g., RANK, LOCAL_RANK, WORLD_SIZE etc. On the other hand, torch.multiprocessing.spawn is general multi-processing, not specifically tailored for torch.distributed.

If you need multi-server distributed data parallel training, it might be more convenient to use torch.distributed.launch as it automatically calculates ranks for you, through --nnode, --node_rank, and --nproc_per_node.If you need single-server multi-gpu data parallel training, both should work the same.