I train a model using DistributedDataParallel with the NCCL backend. At a specific point of the training, I want to reinitialize the process group, therefore I do:
- del self.model (the DDP model)
- del self.optimizer
- self.store = dist.TCPStore(host_name=master_ip, port=int(master_port), is_master=False) (the master is properly setup)
- dist.init_process_group(“nccl”, world_size=world_size, rank=rank, store=self.store, group_name=gname, timeout=timedelta(seconds=10))
- self.model = … (whatever model i have here)
- self.model = DDP(model, device_ids=[self.local_rank])
Although the process group seems to be initialized (the store barrier is passed), the program always hangs but not at a specific point.
Could this be an issue with non-aborted NCCL communicators? Does anyone know how to solve it?