Processes get blocked though using async all-reduce

Hi there, I am trying to use asynchronous all-reduce in torch.distributed, which is introduced in Pytorch Docs. However, I found the processes still get blocked thought I set async_op=True. Can someone tell me where did I go wrong? :thinking:

I copied the example code provided by Docs, adding some sleep and print commands to check if it is blocking.

import torch
import torch.distributed as dist
import os
import time

rank = int(os.getenv('RANK', '0'))

output = torch.tensor([rank]).cuda(rank)
if rank == 1:

s = torch.cuda.Stream()
print(f"Process {rank}: begin aysnc all-reduce", flush=True)
handle = dist.all_reduce(output, async_op=True)
# Wait ensures the operation is enqueued, but not necessarily complete.
# Using result on non-default stream.
print(f"Process {rank}: async check")
if rank == 0:
    # if the explicit call to wait_stream was omitted, the output below will be
    # non-deterministically 1 or 101, depending on whether the allreduce overwrote
    # the value after the add completed.

Process 0: begin aysnc all-reduce
Process 1: begin aysnc all-reduce
Process 1: async check
Process 0: async check
tensor([101], device=‘cuda:0’)

I expect ‘Process 0: async check’ should be printed before ‘Process 1: begin aysnc all-reduce’. Where did I go wrong?

p.s. It seems to be a nccl problem because I got expected output when using gloo.

I just found that I set CUDA_LAUNCH_BLOCKING=1 …
Delete this line and everything works well.