Hi there, I am trying to use asynchronous all-reduce in torch.distributed, which is introduced in Pytorch Docs. However, I found the processes still get blocked thought I set async_op=True. Can someone tell me where did I go wrong?
I copied the example code provided by Docs, adding some sleep and print commands to check if it is blocking.
import torch
import torch.distributed as dist
import os
import time
rank = int(os.getenv('RANK', '0'))
dist.init_process_group(
backend='nccl',
world_size=2,
rank=rank,
)
output = torch.tensor([rank]).cuda(rank)
if rank == 1:
time.sleep(5)
s = torch.cuda.Stream()
print(f"Process {rank}: begin aysnc all-reduce", flush=True)
handle = dist.all_reduce(output, async_op=True)
# Wait ensures the operation is enqueued, but not necessarily complete.
handle.wait()
# Using result on non-default stream.
print(f"Process {rank}: async check")
with torch.cuda.stream(s):
s.wait_stream(torch.cuda.default_stream())
output.add_(100)
if rank == 0:
# if the explicit call to wait_stream was omitted, the output below will be
# non-deterministically 1 or 101, depending on whether the allreduce overwrote
# the value after the add completed.
print(output)
output:
Process 0: begin aysnc all-reduce
Process 1: begin aysnc all-reduce
Process 1: async check
Process 0: async check
tensor([101], device=‘cuda:0’)
I expect ‘Process 0: async check’ should be printed before ‘Process 1: begin aysnc all-reduce’. Where did I go wrong?
p.s. It seems to be a nccl problem because I got expected output when using gloo.