Unexpected Behavior with torch.distributed.isend and irecv in Asynchronous Communication

I am attempting to utilize asynchronous send (isend) and receive (irecv) for non-blocking communication in a PyTorch distributed setup. My understanding of asynchronous operations implies that torch.distributed.isend should return immediately, allowing for non-blocking execution. However, the behavior observed during execution does not align with this expectation.

Steps to Reproduce:

  1. Initialize the distributed environment with NCCL backend and CUDA for GPU support.
  2. Use torch.distributed.isend and irecv for sending and receiving tensors between processes.
  3. Introduce a time.sleep(20) delay in the receiving process to simulate asynchronous behavior.
  4. Observe the timestamps logged after the send and receive operations.

Expected Behavior:

The expectation is that the sending process (process 0) would log its timestamp immediately after initiating the send operation, potentially showing a 20-second earlier timestamp than the receiving process (process 1), which is delayed due to the time.sleep(20).

Actual Behavior:

Both processes log their timestamps almost simultaneously, indicating that the send and receive operations block until both are completed, contrary to the expected asynchronous behavior. This observation suggests that the isend operation might not be returning immediately or that the asynchronous operations are somehow synchronized before logging the timestamps.

import torch
import os
import time

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'

torch.distributed.init_process_group(backend="nccl")
torch.cuda.set_device(torch.distributed.get_rank() % torch.distributed.get_world_size())

if torch.distributed.get_rank() == 0:
    t = torch.randn(16, 256, 128).cuda()
else:
    t = torch.zeros(16, 256, 128).cuda()

if torch.distributed.get_rank() == 0:
    req = torch.distributed.isend(tensor=t, dst=1)
    timestamp = time.time()
else:
    time.sleep(20)
    req = torch.distributed.irecv(tensor=t, src=0)
    timestamp = time.time()

print(f"{torch.distributed.get_rank()} finishes ", timestamp)

output is:

torchrun --nproc_per_node=2 asyncSendRecv.py
[2024-03-25 14:43:13,341] torch.distributed.run: [WARNING] 
[2024-03-25 14:43:13,341] torch.distributed.run: [WARNING] *****************************************
[2024-03-25 14:43:13,341] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-03-25 14:43:13,341] torch.distributed.run: [WARNING] *****************************************
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
1 finishes  1711349017.1358058
0 finishes  1711349017.1358182