Update the process group in torch.distributed created using dist.init_process_group

Goal: Distributed Training with Dynamic machine location, where worker’s device location can change.

For e.g. 4 Worker Parameter Server setting. Now, for first 2 epochs 2 workers are run on Machine 1, but after 2 epochs they are supposed to be run on Machine 2.

I am assuming, since the worker’s machine change after 2 epochs, dist.init_process_group() needs to be initialized. However, reinitializing issues this error.

RuntimeError: trying to initialize the default process group twice

What’s the correct way to update ‘process_group()’ ?

Solution Ideas:

  1. Is there anyway to delete the initialized process group? So that before re-initialization using dist.init_process_group() I can delete the prior process group, thus avoiding the issue.

Hey @adarsh-kr,

There is a destroy_process_group API to clear the default ProcessGroup
instance. If you would like to create multiple ProcessGroup instances, you can do so using the new_grou API.