Getting Subgroup Ranks Info in Custom Backend

esaliya · July 29, 2021, 3:01pm

Hi All,

Looking at the creation of a sub group using new_group(ranks, backend) construct, I see it goes through _new_process_group_helper().

I am interested in getting this call to a custom backend, which I see is delegated to the registered process group creation method in the same method at L724

However, it doesn’t seem to forward the participating ranks list to the method. Any help on how to get this information in the custom backend during new_group() creation?

Thanks,

H-Huang · July 30, 2021, 6:55pm

Perhaps I am misunderstanding the question, but the rank is passed into the method and the rank argument is there in the code snippet you sent. For a third party backend you need to register it and perform init process group on all the ranks (Distributed communication package - torch.distributed — PyTorch master documentation). Then you can use the collectives as usual.

esaliya · August 2, 2021, 1:38am

Thanks, yes, I am registering a third-party backend as in the document you’ve shared.

What I am referring to is the new_group() method in a ProcessGroup.

This allows to create a subgroup from the default process group by taking on a list of ranks. This list is referred to as the group_ranks in the _new_process_group_helper() method I pointed above.

My question was that this group_ranks list is not passed to the custom backend method. Instead, only the rank and world_size are passed.

Sees like a missing feature here.

esaliya · August 11, 2021, 7:46am

Confirmed, this is a missing feature add c10d dynamic loading mechanism and unit test by ftian1 · Pull Request #28068 · pytorch/pytorch · GitHub