In P2P communication APIs like torch.distributed.send(tensor, dst, group=None, tag=0)
, no matter what group
is, ranks have to be the ones in the global process group.
Suppose there are two ranks in a smaller subgroup than the global one. Are there any differences regarding performance if using the global group in the APIs compared to using the smaller subgroup?