Deadlock using DistributedDataParallel

111110 · July 17, 2019, 10:02am

Hi, I’m facing a problem. When using DistributedDataParallel with NCCL, my training will meet a deadlock. According to the pytorch doc, I try to set the set_start_method to spawn and forkserver, but an error that address already in use occurs.

Vyom_Kavishwar · July 24, 2019, 2:53pm

Faced a similar error - solved it by initializing the process group first, and then setting the model cuda device (as opposed to the other way around, which led to the same kind of deadlock you describe)