Update the process group in torch.distributed created using dist.init_process_group

adarsh-kr · August 9, 2020, 9:22pm

Goal: Distributed Training with Dynamic machine location, where worker’s device location can change.

For e.g. 4 Worker Parameter Server setting. Now, for first 2 epochs 2 workers are run on Machine 1, but after 2 epochs they are supposed to be run on Machine 2.

I am assuming, since the worker’s machine change after 2 epochs, dist.init_process_group() needs to be initialized. However, reinitializing issues this error.

RuntimeError: trying to initialize the default process group twice

What’s the correct way to update ‘process_group()’ ?

Solution Ideas:

Is there anyway to delete the initialized process group? So that before re-initialization using dist.init_process_group() I can delete the prior process group, thus avoiding the issue.

mrshenli · August 10, 2020, 2:33pm

Hey @adarsh-kr,

There is a destroy_process_group API to clear the default ProcessGroup
instance. If you would like to create multiple ProcessGroup instances, you can do so using the new_grou API.

github.com

pytorch/pytorch/blob/05f00532f52883d29a08d96b2961042cc41573ab/torch/distributed/distributed_c10d.py#L530-L577


def destroy_process_group(group=group.WORLD):
    """
    Destroy a given process group, and deinitialize the distributed package

    Arguments:
        group (ProcessGroup, optional): The process group to be destroyed, if
                                        group.WORLD is given, all process
                                        groups including the default one will
                                        be destroyed.
    """
    global _pg_map
    global _pg_names
    global _pg_group_ranks
    global _default_pg
    global _default_pg_init_method
    global _group_count

    if group == GroupMember.NON_GROUP_MEMBER:
        return

This file has been truncated. show original