Difference between torch.distributed and python run

base-y · April 30, 2022, 7:44pm

Hi, can I know whats the difference between the following:

python train.py .....<ARGS>
python -m torch.distributed.launch <ARGS>
deepspeed train.py <ARGS>

Even option 1 seem to be using some sort of distributed training when there are multiple gpus. My understanding was option 1 will only use 1 gpu if not given distributed.launch.

Also, does deepspeed use torch.distributed in the backend or is it something different?

thanks in advance

ptrblck · May 2, 2022, 7:12am

It depends what train.py uses as inside the script still multiple processes could be spawned.

cbalioglu · May 2, 2022, 6:23pm

In addition to @ptrblck’s answer; there is really no magic inside python -m torch.distributed.launch. It can be considered mostly a convenience launcher script that helps you to run distributed jobs while avoiding most of the boilerplate code. Ultimately it set ups the process group(s), environment variables, etc. that you can consume in your training script passed in <ARGS>. Technically there is nothing preventing you from replicating all that code in a custom train.py script (i.e. option 1), but in most use cases using python -m torch.distributed.launch is much easier and future-proof.

base-y · May 6, 2022, 1:10pm

I am using huggingface trainer inside train.py. Could be that the trainer is spawning multiple processes. Maybe thats why option 1 too is using dist.training.