Performance regarding `group` argument in p2p comm

In P2P communication APIs like torch.distributed.send(tensor, dst, group=None, tag=0), no matter what group is, ranks have to be the ones in the global process group.

Suppose there are two ranks in a smaller subgroup than the global one. Are there any differences regarding performance if using the global group in the APIs compared to using the smaller subgroup?