Hi, I’m trying to use DataParallel on a build from master (0.4.0a0+361baa5) but the replication of the module which calls comm.broadcast_coalesced is not returning. The program just hangs there.
I wonder if it has anything to do with https://github.com/pytorch/pytorch/pull/4999. Any other ideas?
Maybe @fmassa has some ideas?
EDIT: I think there is a problem with NCCL, I’m now trying with NO_SYSTEM_NCCL
EDIT2: Recompiling PyTorch with the included NCCL instead of the system NCCL solved the issue