Here is a simple example:
import torch
import torch.distributed as dist
def main():
import os
# Initialize the NCCL backend
dist.init_process_group(backend='nccl')
world_size = int(os.environ['WORLD_SIZE'])
rank = int(os.environ['RANK'])
# Create a tensor on the GPU
tensor = torch.rand(10).cuda(rank)
# Start CUDA graph capture
stream = torch.cuda.Stream(device=f"cuda:{rank}")
graph = torch.cuda.CUDAGraph()
stream.synchronize()
with torch.cuda.graph(graph, stream=stream):
# Perform all-reduce operation
dist.all_reduce(tensor)
stream.synchronize()
# Execute the graph
graph.replay()
print("All-reduce completed:", tensor)
if __name__ == "__main__":
main()
Run with torchrun --nproc-per-node 8 test.py
, got the following error:
torch/distributed/distributed_c10d.py", line 2050, in all_reduce
work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'operation not permitted when stream is capturing'
Per the nccl documentation:
Starting with NCCL 2.9, NCCL operations can be captured by CUDA Graphs.
So why does pytorch cudagraph fail to capture allreduce operation?