Hi, I’m facing a problem. When using DistributedDataParallel
with NCCL, my training will meet a deadlock. According to the pytorch doc, I try to set the set_start_method
to spawn
and forkserver
, but an error that address already in use
occurs.
Faced a similar error - solved it by initializing the process group first, and then setting the model cuda device (as opposed to the other way around, which led to the same kind of deadlock you describe)