Goal: Distributed Training with Dynamic machine location, where worker’s device location can change.
For e.g. 4 Worker Parameter Server setting. Now, for first 2 epochs 2 workers are run on Machine 1, but after 2 epochs they are supposed to be run on Machine 2.
I am assuming, since the worker’s machine change after 2 epochs, dist.init_process_group() needs to be initialized. However, reinitializing issues this error.
RuntimeError: trying to initialize the default process group twice
What’s the correct way to update ‘process_group()’ ?
Solution Ideas:
Is there anyway to delete the initialized process group? So that before re-initialization using dist.init_process_group() I can delete the prior process group, thus avoiding the issue.
There is a destroy_process_group API to clear the default ProcessGroup
instance. If you would like to create multiple ProcessGroup instances, you can do so using the new_grou API.