Program hangs after NCCL process group is destroyed and restarted

Hi all!

I train a model using DistributedDataParallel with the NCCL backend. At a specific point of the training, I want to reinitialize the process group, therefore I do:

  • dist.destroy_process_group
  • del self.model (the DDP model)
  • del self.optimizer
  • self.store = dist.TCPStore(host_name=master_ip, port=int(master_port), is_master=False) (the master is properly setup)
  • dist.init_process_group(“nccl”, world_size=world_size, rank=rank, store=self.store, group_name=gname, timeout=timedelta(seconds=10))
  • self.model = … (whatever model i have here)
  • self.model = DDP(model, device_ids=[self.local_rank])

Although the process group seems to be initialized (the store barrier is passed), the program always hangs but not at a specific point.

Could this be an issue with non-aborted NCCL communicators? Does anyone know how to solve it?

Thank you

1 Like

Could you post a script that reproduces this issue? Is it happening on with multiple devices on a single node, multiple nodes, or both?

Additionally, have you checked the output when running with e.g., NCCL_DEBUG=INFO to see if that uncovers a specific failure? Environment Variables — NCCL 2.12.12 documentation

1 Like

Hi,

Thank you for your message!
The problem was due to some NCCL communicators that were not aborted. This caused subsequent calls to cudaStreamSynchronize to block. I solved it by explicitly removing all the “old” communicators.

1 Like