When should I call `dist.destory_process_group()`?

Fei_Liu · March 8, 2023, 1:41am

Or should I never call it if I’m not re-init another process group?

Context: I found some peculiar bug when doing multi-node GPU training: “early exits of remote peer”. It can be reproduced with the following code (at least with my distributed GPU cluster):

dist.init_process_group('nccl', rank=rank, world_size=world_size)
# do the work
# dadadadadadada
# done

if rank == 0:
    time.sleep(30)  # the key trick: let rank 0 lag behind so it exits later than the others

dist.destory_process_group()

With this, I noticed that, sometimes (non deterministic, but fairly high probability, ~3 out of 10), rank 0 will complain “early exits of remote peers” when the program finishes. And this happens only if I run on two or more machines (nodes). If it’s single-node-multi-process then it’ll be fine (100% fine, never ran into issues). It seems to me that the dist.destory_process_group() doesn’t do any sync across nodes and just blatantly exit the process (that’s how I “feel” based on my observation of the crushes and error messages, not necessarily how it actually is), or at least the sync is not persistent so the issue sometimes occurs. Now if I do a sync before the destory:

dist.barrier()
dist.destroy_process_group()

The issue will be alleviated, but still happens (say 1 out of 50).

What’s more interesting: if I just remove the dist.destory_process_group(), then everything seems fine (or at least a lot better). Now all the nodes will be waiting on the sleep from rank 0 and exit together. But even in this case, the issue still happens occasionally. My current solution is to just do a dist.barrier() at the end of the program, which looks nasty, let alone that I’m 0% sure whether that’s the right thing to do.

Questions:

Does the above indicate that I have some misunderstanding of the dist init/destroy api? Or something is off with my distributed cluster environment (e.g. torch is not properly installed? some network issue causing the “early exit of remote peers” ?)
Does dist.destory_process_group() do any sync? I suppose it should, ideally? (So any barrier() call is redundant?)
What’s the “right” way to init/destroy the process group properly, especially in a multi-node training environment?

zuliani99 · March 17, 2024, 2:27pm

Do you have completely solved this issue? I’m asking this since I’m also encountering this behaviour that you describe.

H-Huang · March 18, 2024, 12:58pm

I’m not sure where the “early exits of remote peers” message is coming from so a stack trace would be helpful, but here are the question answers:

The understanding is in general correct. init_process_group is a synchronization point for the program. In general, destroy does not actually need to be called if the program is going to shutdown anyways.
It does not do a sync. destroy_process_group() will clear the state of the process group for that particular process. The distributed package holds state and information about that process group. If the program is exitting anyways, then the garbage collector does that for you
You really only need init_process_group, barrier() at the end will be helpful for debug if there is some desync issue, but if the program is implemented correctly (all collectives have their other corresponding collectives across other ranks) then there shouldn’t be a problem

Fei_Liu · March 18, 2024, 6:59pm

I just no longer use destroy_process_group anymore. This worked well in my use case.

zuliani99 · March 18, 2024, 9:21pm

@Fei_Liu So you removed the dist.destory_process_group() at the end of the function and instead of it you placed a dist.barrier(), correct?

Fei_Liu · March 19, 2024, 5:31am

I didn’t even bother using dist.barrier() in this case. For my case the distroy_process_group (if any) runs at the end of the program. So in theory it’s not needed as @H-Huang mentioned: the GC will take care of everything when the process exits. It’s more of a neat reason to use distroy_process_group. Now I just let go and stop being too neat essentially.

zuliani99 · March 19, 2024, 6:14am

Ok but destroy_process_group has to be called each time the computation of each process ends or only one time on the main function before actually terminating the entire application?