Anyone have advice on how to adapt the ddp_pipeline tutorial for multi-node training?
I’ve been using torchrun for multi-node DistributedDataParallel training; however, the ddp_pipeline example relies on mp.spawn.
I tried to modify the the ddp_pipeline script to run with torchrun, but it caused the DistributedDataParallel(model) call to hang.
Would love any suggestions for how to either: (1) adapt ddp_pipeline for use with torchrun or (2) how to set up multi-node training using mp.spawn.