Is cuda_graph compatible with DistributedDataParallel

I test the code as follows, the mode is a DistributedDataParallel module

s = torch.cuda.Stream()
    for i in range(3):
        train_step(model, inputs, targets, optimizer, scaler, update=True)

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    train_step(model, inputs, targets, optimizer, scaler, update=True)

It reports "
File “/mnt/lustressd/luojiapeng/projects/deepspeed_tests/tools/”, line 74, in train_step
outputs = model(inputs)
File “/mnt/cache/luojiapeng/miniconda3/envs/torch1.12/lib/python3.10/site-packages/torch/nn/modules/”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/mnt/cache/luojiapeng/miniconda3/envs/torch1.12/lib/python3.10/site-packages/torch/nn/parallel/”, line 976, in forward
RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Yes, CUDA Graphs are compatible with DDP as explained here. As the doc explains, you need to run at least 11 warmup iterations in DDP-enabled eager mode before the capture is executed.

Thanks for the quick reply. This information is very helpful. I’m still working on it, but not been successful yet.