Anyone have advice on how to adapt the ddp_pipeline tutorial for multi-node training?
I’ve been using torchrun
for multi-node DistributedDataParallel
training; however, the ddp_pipeline example relies on mp.spawn
.
I tried to modify the the ddp_pipeline script to run with torchrun
, but it caused the DistributedDataParallel(model)
call to hang.
Would love any suggestions for how to either: (1) adapt ddp_pipeline for use with torchrun
or (2) how to set up multi-node training using mp.spawn
.