It’s possible that NCCL takes a longer time to initialize, could you try to bump the timeout and see if that helps resolve the issue?
In addition, it might be useful to add a dist.barrier() call after init_process_group which will synchronize all ranks after they have completed initialization successfully, which might help debugging. In this particular case, it is also possible that machine “A” exits the script, which tears down the store hosted on rank 0, resulting in rank 1’s error.
I read the source code of init_process_group. It calls dist.barrier() at its end automatically.
def init_process_group(
backend,
init_method=None,
timeout=default_pg_timeout,
world_size=-1,
rank=-1,
store=None,
group_name="",
pg_options=None,
):
#......
# barrier at the end to ensure that once we return from this method, all
# process groups including global variables are updated correctly on all
# ranks.
if backend == Backend.MPI:
# MPI backend doesn't use store.
barrier()
else:
# Use store based barrier here since barrier() used a bunch of
# default devices and messes up NCCL internal state.
_store_based_barrier(rank, store, timeout)
# Set sequence numbers for gloo and nccl process groups.
if get_backend(default_pg) in [Backend.GLOO, Backend.NCCL]:
default_pg._set_sequence_number_for_group()
I think adding another dist.barrier() call after init_process_group is not needed.