How to Adapt DDP Pipeline Tutorial for Multi-Node Training

jasonkrone · March 10, 2024, 4:16am

Anyone have advice on how to adapt the ddp_pipeline tutorial for multi-node training?

I’ve been using torchrun for multi-node DistributedDataParallel training; however, the ddp_pipeline example relies on mp.spawn.

I tried to modify the the ddp_pipeline script to run with torchrun, but it caused the DistributedDataParallel(model) call to hang.

Would love any suggestions for how to either: (1) adapt ddp_pipeline for use with torchrun or (2) how to set up multi-node training using mp.spawn.

jasonkrone · March 27, 2024, 12:29am

Solved this. See my answer here!