According to NCCL documentation, since NCCL 2.7 Point-to-point communication can be achieved using ncclSend and ncclRecv. However in Pytorch, the newest stable version still doesn’t support send and receive when using NCCL as backend. I’m wondering is there anyway to achieve point to point communication between GPUs in Pytorch? And is there any way to integrate ncclSend and ncclRecv in Pytorch distributed?
For now, to work around it, you can create a sub group of 2 ranks, and then use dist.broadcast(tensor, src, group=sub_group) to mimic P2P send/recv. PipeDream is already using that.