Two ranks do send and recv to exchange data.
Rank 0 groups them together, while rank 1 calls them separately.
Is the behaviour undefined?
send_tensor = torch.arange(2, dtype=torch.float32, device='cuda') + 2 * rank
recv_tensor = torch.randn(2, dtype=torch.float32, device='cuda')
if rank == 0:
send_op = dist.P2POp(dist.isend, send_tensor, 1)
recv_op = dist.P2POp(dist.irecv, recv_tensor, 1)
reqs = dist.batch_isend_irecv([send_op, recv_op])
for req in reqs:
req.wait()
else:
send_op = dist.P2POp(dist.isend, send_tensor, 0)
recv_op = dist.P2POp(dist.irecv, recv_tensor, 0)
reqs = dist.batch_isend_irecv([send_op])
reqs += dist.batch_isend_irecv([recv_op])
for req in reqs:
req.wait()
I got an ncclInternalError
. Is this defined by NCCL?
Traceback (most recent call last):
File "test_comm.py", line 30, in <module>
reqs = dist.batch_isend_irecv([send_op])
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1810, in _coalescing_manager
work = group._end_coalescing(device)
torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3608, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
ncclInternalError: Internal check failed.
Last error:
Message truncated : received 4096 bytes instead of 2048