When should I call `dist.destory_process_group()`?

Or should I never call it if I’m not re-init another process group?

Context: I found some peculiar bug when doing multi-node GPU training: “early exits of remote peer”. It can be reproduced with the following code (at least with my distributed GPU cluster):

dist.init_process_group('nccl', rank=rank, world_size=world_size)
# do the work
# dadadadadadada
# done

if rank == 0:
    time.sleep(30)  # the key trick: let rank 0 lag behind so it exits later than the others

dist.destory_process_group()

With this, I noticed that, sometimes (non deterministic, but fairly high probability, ~3 out of 10), rank 0 will complain “early exits of remote peers” when the program finishes. And this happens only if I run on two or more machines (nodes). If it’s single-node-multi-process then it’ll be fine (100% fine, never ran into issues). It seems to me that the dist.destory_process_group() doesn’t do any sync across nodes and just blatantly exit the process (that’s how I “feel” based on my observation of the crushes and error messages, not necessarily how it actually is), or at least the sync is not persistent so the issue sometimes occurs. Now if I do a sync before the destory:

dist.barrier()
dist.destroy_process_group()

The issue will be alleviated, but still happens (say 1 out of 50).

What’s more interesting: if I just remove the dist.destory_process_group(), then everything seems fine (or at least a lot better). Now all the nodes will be waiting on the sleep from rank 0 and exit together. But even in this case, the issue still happens occasionally. My current solution is to just do a dist.barrier() at the end of the program, which looks nasty, let alone that I’m 0% sure whether that’s the right thing to do.

Questions:

  1. Does the above indicate that I have some misunderstanding of the dist init/destroy api? Or something is off with my distributed cluster environment (e.g. torch is not properly installed? some network issue causing the “early exit of remote peers” ?)
  2. Does dist.destory_process_group() do any sync? I suppose it should, ideally? (So any barrier() call is redundant?)
  3. What’s the “right” way to init/destroy the process group properly, especially in a multi-node training environment?