PyTorch Distributed Gloo Backend

Is it a good practice to create a new group on every training iteration?

Using dist.new_group requires every process to pass through the function even if they are not a part of the distributed training process. Sometimes it hangs up on this function for a reason that I am not aware of. I was wondering has anybody come up with a better solution for this?


iiuc, it is not good practice to create a new group every training iteration, it is non trivial to initialize it, there are communication costs.

why do you need to create new group every iteration?

Thanks for the answer. Because I need to run allreduce on a subset of nodes in each iteration. Do you have any recommendation about how to do it differently?

do you want to create these sub groups before training loop, and each iteration just uses corresponding created sub groups? instead of creating the same sub groups repeatedly inside the loop?

also, add time out if it hangs?

lastly, maybe need to figure out why it hangs, starting with debugging using a small number of training iterations?

Thanks for the suggestion! I originally thought about this approach. But because I would like to create this for large number of machines (around 100) the number of groups can become quite large (combination of 10 out of 100 =17310309456440!!).

I debugged it and it seems like the limit for the number of cgroups exceeds the current user limit. I guess this may be related to the dist.new_group and repeated creation of groups.

The error printed in the kernel messages is printed below:

cgroup: fork rejected by pids controller in /user.slice/user-1000.slice/session-3.scope

I wish there was a method to delete the groups in order to avoid this problem.

you can call dist.destroy_process_group()