Timed out initializing process group in store based barrier

Icosan · July 21, 2023, 4:32pm

I had the same issue trying to parallel process multiple models for model combination.

I fixed the problem for me by NEVER calling destroy_process_group(). Instead, once I call init_process_group(…), I only check if the process_group has been initialised with torch.distributed.is_initialized().

I observe that if you ever destroy a process group, I cannot initalise anymore, because I run into this timeout.