How to do simultanous isend/irecv

Is there a way to have rank 0 and 1 send a message to each other at the same time and then recv the message sent at the same time without resulting in deadlock?

I’ve tried the following code

reqs = []
if rank == 0:
  neighbour = 1
if rank ==1:
  neighbour = 0

reqs.append(dist.isend(tensor=send_tensor, dst=neighbour))
reqs.append(dist.irecv(tensor=recv_tensor, src=neighbour))
for req in reqs:
    req.wait()

But as you can imagine, the first “req” in the “reqs” array is the send request for both processors. And both end up waiting on the other one to receive, resulting in deadlock.

Of course you can do send->recv in rank 1 and recv->send in rank 2. But this would take twice the amount of time.

In MPI there is a wait_all function such that the requests don’t have to be fulfilled in a predetermined order to prevent the deadlock. But there isn’t one in pyTorch.

This seems a very simple thing to do but I couldn’t figure out how to do it. Any help would be appreciated.

Thank you thank you!

Have you taken a look at e.g., batch_isend_irecv?

Thanks for the reply!

I did some benchmarking on my machine. It seems like doing the following:

send_tensor = torch.zeros(1).cuda()
recv_tensor = torch.zeros(1).cuda()

if rank == 0:
  sendOp = dist.P2POp(dist.isend, send_tensor, 1)
  recvOp = dist.P2POp(dist.irecv, recv_tensor, 1)
  reqs = dist.batch_isend_irecv([sendOp, recvOp])
  for req in reqs:
    req.wait()
elif rank == 1:
   sendOp = dist.P2POp(dist.isend, send_tensor, 0)
   recvOp = dist.P2POp(dist.irecv, recv_tensor, 0)
   reqs = dist.batch_isend_irecv([sendOp, recvOp])
   for req in reqs:
     req.wait()

takes twice amount of time as doing 1 isend as the following:

if rank == 0:
  reqs.append(dist.isend(tensor=send_tensor, dst=1))
else:
  reqs.append(dist.irecv(tensor=recv_tensor, src=0))
for req in reqs:
  req.wait()

Looking at this github issue It does seem like the batch_isend_irecv() is supposed to support concurrent send/recv. So I don’t know what I’m missing here.