According to NCCL documentation, since NCCL 2.7 Point-to-point communication can be achieved using ncclSend and ncclRecv. However in Pytorch, the newest stable version still doesn’t support send and receive when using NCCL as backend. I’m wondering is there anyway to achieve point to point communication between GPUs in Pytorch? And is there any way to integrate ncclSend and ncclRecv in Pytorch distributed?
Hey @Yi_Zhang, we are working on adding P2P to NCCL ProcessGroup backend. We just bumped up the NCCL submodule version in https://github.com/pytorch/pytorch/pull/41608.
For now, to work around it, you can create a sub group of 2 ranks, and then use
dist.broadcast(tensor, src, group=sub_group) to mimic P2P send/recv. PipeDream is already using that.
If you need general P2P support, you could try the RPC API. A caveat is that we are still working on improving support for GPU tensors. https://github.com/pytorch/pytorch/issues/41369