What is the implementation and performance differences between torch.distributed.launch
and torch.multiprocessing.spawn
?
torch.distributed.launch
uses subprocess.Popen
. The perf differences between these two are typical multiprocessing
vs subprocess
Besides that, torch.distributed.launch
also tries to configure several env vars and pass command line arguments for distributed training script, e.g., RANK
, LOCAL_RANK
, WORLD_SIZE etc. On the other hand, torch.multiprocessing.spawn
is general multi-processing, not specifically tailored for torch.distributed
.
If you need multi-server distributed data parallel training, it might be more convenient to use torch.distributed.launch
as it automatically calculates ranks for you, through --nnode
, --node_rank
, and --nproc_per_node
.If you need single-server multi-gpu data parallel training, both should work the same.
2 Likes