How to do simultanous isend/irecv

Zitong_Li · February 26, 2023, 9:35pm

Is there a way to have rank 0 and 1 send a message to each other at the same time and then recv the message sent at the same time without resulting in deadlock?

I’ve tried the following code

reqs = []
if rank == 0:
  neighbour = 1
if rank ==1:
  neighbour = 0

reqs.append(dist.isend(tensor=send_tensor, dst=neighbour))
reqs.append(dist.irecv(tensor=recv_tensor, src=neighbour))
for req in reqs:
    req.wait()

But as you can imagine, the first “req” in the “reqs” array is the send request for both processors. And both end up waiting on the other one to receive, resulting in deadlock.

Of course you can do send->recv in rank 1 and recv->send in rank 2. But this would take twice the amount of time.

In MPI there is a wait_all function such that the requests don’t have to be fulfilled in a predetermined order to prevent the deadlock. But there isn’t one in pyTorch.

This seems a very simple thing to do but I couldn’t figure out how to do it. Any help would be appreciated.

Thank you thank you!

eqy · February 26, 2023, 10:02pm

Have you taken a look at e.g., batch_isend_irecv?

github.com

pytorch/pytorch/blob/d3e1f165b36dc8b6be8c7bdffd1a7b8cc1be221d/torch/distributed/distributed_c10d.py#L1417


      
          def _coalescing_manager(group, device, reqs):
              if group is None:
                  group = _get_default_group()
              group._start_coalescing(device)
              try:
                  yield
              finally:
                  group._end_coalescing(device, reqs)
          
          

          
def batch_isend_irecv(p2p_op_list):
              """
              Send or Receive a batch of tensors asynchronously and return a list of requests.
          
          
    Process each of the operations in ``p2p_op_list`` and return the corresponding
              requests. NCCL, Gloo, and UCC backend are currently supported.
          
          
    Args:
                  p2p_op_list: A list of point-to-point operations(type of each operator is
                      ``torch.distributed.P2POp``). The order of the isend/irecv in the list
                      matters and it needs to match with corresponding isend/irecv on the

Zitong_Li · February 28, 2023, 10:42pm

Thanks for the reply!

I did some benchmarking on my machine. It seems like doing the following:

send_tensor = torch.zeros(1).cuda()
recv_tensor = torch.zeros(1).cuda()

if rank == 0:
  sendOp = dist.P2POp(dist.isend, send_tensor, 1)
  recvOp = dist.P2POp(dist.irecv, recv_tensor, 1)
  reqs = dist.batch_isend_irecv([sendOp, recvOp])
  for req in reqs:
    req.wait()
elif rank == 1:
   sendOp = dist.P2POp(dist.isend, send_tensor, 0)
   recvOp = dist.P2POp(dist.irecv, recv_tensor, 0)
   reqs = dist.batch_isend_irecv([sendOp, recvOp])
   for req in reqs:
     req.wait()

takes twice amount of time as doing 1 isend as the following:

if rank == 0:
  reqs.append(dist.isend(tensor=send_tensor, dst=1))
else:
  reqs.append(dist.irecv(tensor=recv_tensor, src=0))
for req in reqs:
  req.wait()

Looking at this github issue It does seem like the batch_isend_irecv() is supposed to support concurrent send/recv. So I don’t know what I’m missing here.