For Gloo in Pytorch distributed, as shown in this document Distributed communication package - torch.distributed — PyTorch 1.9.1 documentation, will the following code get performance benefits of using CUDA-aware MPI? (e.g., GPU-to-GPU transferring via PCIe while bypassing CPU)
group = dist.new_group([0, 1], backend="gloo")
dist.all_reduce(gpu_tensor_a, op=dist.ReduceOp.SUM, group=group)