Difference between torch.distributed and python run

cbalioglu · May 2, 2022, 6:23pm

In addition to @ptrblck’s answer; there is really no magic inside python -m torch.distributed.launch. It can be considered mostly a convenience launcher script that helps you to run distributed jobs while avoiding most of the boilerplate code. Ultimately it set ups the process group(s), environment variables, etc. that you can consume in your training script passed in <ARGS>. Technically there is nothing preventing you from replicating all that code in a custom train.py script (i.e. option 1), but in most use cases using python -m torch.distributed.launch is much easier and future-proof.