Isend and irecv happen on different streams with NCCL

Hi, I’m using torch.distributed.isend and torch.distributed.irecv for asynchronized communication with NCCL backend.

I assumed isend and irecv will happen on a same cuda stream without specifying a nccl group:

for _ in range(16):
    work = torch.distributed.irecv(input, prev_rank)
    ...
    work.wait()
    output = sub_model(input)
    torch.distributed.isend(output, next_rank)
    ...

But I found they actually happen on two streams and got nsys profiled results like this:

Note this is one of the devices, where default stream 7 is performing computation and red circles are performing send and recv kernel. Is this expected?

Docker image: nvcr.io/nvidia/pytorch:21.10-py3
PyTorch version: 1.10.0a0+0aef44c