Do I need to implement my own collective communication primitive?

Hi, I’m wondering if I should (and how to) implement a weird collective primitive to fulfill my experiment.

Basically, in my experiment, each process has two process groups, and each call of this weird primitive would invoke one all-gather in each group.

Would it be more communication-effective if I compact these two all-gather into a single new collective primitive?

If so, how can I implement this primitive? I notice that there is an example in test/cpp_extensions/cpp_c10d_extension.cpp, but the problem is that I know nothing about the c10d code organization. Where should I start to understand the source code of torch.distributed to implement my own primitive?