Is cuda_graph compatible with DistributedDataParallel

woolpeeker · September 13, 2022, 6:07am

I test the code as follows, the mode is a DistributedDataParallel module

s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        train_step(model, inputs, targets, optimizer, scaler, update=True)
torch.cuda.current_stream().wait_stream(s)

g = torch.cuda.CUDAGraph()
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    train_step(model, inputs, targets, optimizer, scaler, update=True)

It reports "
File “/mnt/lustressd/luojiapeng/projects/deepspeed_tests/tools/run_ddp.py”, line 74, in train_step
outputs = model(inputs)
File “/mnt/cache/luojiapeng/miniconda3/envs/torch1.12/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/mnt/cache/luojiapeng/miniconda3/envs/torch1.12/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 976, in forward
self.logger.set_runtime_stats_and_log()
RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
"

ptrblck · September 13, 2022, 6:14am

Yes, CUDA Graphs are compatible with DDP as explained here. As the doc explains, you need to run at least 11 warmup iterations in DDP-enabled eager mode before the capture is executed.

woolpeeker · September 21, 2022, 1:52am

Thanks for the quick reply. This information is very helpful. I’m still working on it, but not been successful yet.