Processes get blocked though using async all-reduce

Hi there, I am trying to use asynchronous all-reduce in torch.distributed, which is introduced in Pytorch Docs. However, I found the processes still get blocked thought I set async_op=True. Can someone tell me where did I go wrong? :thinking:

I copied the example code provided by Docs, adding some sleep and print commands to check if it is blocking.

import torch
import torch.distributed as dist
import os
import time

rank = int(os.getenv('RANK', '0'))
dist.init_process_group(
        backend='nccl',
        world_size=2,
        rank=rank,
        )

output = torch.tensor([rank]).cuda(rank)
if rank == 1:
       time.sleep(5)

s = torch.cuda.Stream()
print(f"Process {rank}: begin aysnc all-reduce", flush=True)
handle = dist.all_reduce(output, async_op=True)
# Wait ensures the operation is enqueued, but not necessarily complete.
handle.wait()
# Using result on non-default stream.
print(f"Process {rank}: async check")
with torch.cuda.stream(s):
    s.wait_stream(torch.cuda.default_stream())
    output.add_(100)
if rank == 0:
    # if the explicit call to wait_stream was omitted, the output below will be
    # non-deterministically 1 or 101, depending on whether the allreduce overwrote
    # the value after the add completed.
    print(output)

output:
Process 0: begin aysnc all-reduce
Process 1: begin aysnc all-reduce
Process 1: async check
Process 0: async check
tensor([101], device=‘cuda:0’)

I expect ‘Process 0: async check’ should be printed before ‘Process 1: begin aysnc all-reduce’. Where did I go wrong?

p.s. It seems to be a nccl problem because I got expected output when using gloo.

I just found that I set CUDA_LAUNCH_BLOCKING=1 …
Delete this line and everything works well.