Program hangs after NCCL process group is destroyed and restarted

fot · June 10, 2022, 12:20pm

Hi all!

I train a model using DistributedDataParallel with the NCCL backend. At a specific point of the training, I want to reinitialize the process group, therefore I do:

dist.destroy_process_group
del self.model (the DDP model)
del self.optimizer
self.store = dist.TCPStore(host_name=master_ip, port=int(master_port), is_master=False) (the master is properly setup)
dist.init_process_group(“nccl”, world_size=world_size, rank=rank, store=self.store, group_name=gname, timeout=timedelta(seconds=10))
self.model = … (whatever model i have here)
self.model = DDP(model, device_ids=[self.local_rank])

Although the process group seems to be initialized (the store barrier is passed), the program always hangs but not at a specific point.

Could this be an issue with non-aborted NCCL communicators? Does anyone know how to solve it?

Thank you

eqy · June 11, 2022, 3:47am

Could you post a script that reproduces this issue? Is it happening on with multiple devices on a single node, multiple nodes, or both?

Additionally, have you checked the output when running with e.g., NCCL_DEBUG=INFO to see if that uncovers a specific failure? Environment Variables — NCCL 2.12.12 documentation

fot · June 15, 2022, 1:14pm

Hi,

Thank you for your message!
The problem was due to some NCCL communicators that were not aborted. This caused subsequent calls to cudaStreamSynchronize to block. I solved it by explicitly removing all the “old” communicators.