Comm.broadcast_coalesced not returning

Hi, I’m trying to use DataParallel on a build from master (0.4.0a0+361baa5) but the replication of the module which calls comm.broadcast_coalesced is not returning. The program just hangs there.

I wonder if it has anything to do with https://github.com/pytorch/pytorch/pull/4999. Any other ideas?

Maybe @fmassa has some ideas? :slight_smile:

EDIT: I think there is a problem with NCCL, I’m now trying with NO_SYSTEM_NCCL

EDIT2: Recompiling PyTorch with the included NCCL instead of the system NCCL solved the issue

Hi Ignacio-Rocco, I just met the exactly same problem and in my case it seems happen only occasionally, and sometimes there is no such problem. I wonder if you could explain a little more on the point “recompiling pytorch with the ‘included’ NCCL”. What do you mean by ‘included NCCL’?

I am using torch 1.4.0 .

Thanks a lot.

EDIT: The reason is setting somewhere os.environment['CUDA_LAUNCH_BLOCKING'] = "1" .