Hi, I’m wondering if I should (and how to) implement a weird collective primitive to fulfill my experiment.
Basically, in my experiment, each process has two process groups, and each call of this weird primitive would invoke one all-gather in each group.
Would it be more communication-effective if I compact these two all-gather into a single new collective primitive?
If so, how can I implement this primitive? I notice that there is an example in test/cpp_extensions/cpp_c10d_extension.cpp
, but the problem is that I know nothing about the c10d
code organization. Where should I start to understand the source code of torch.distributed
to implement my own primitive?