How does DDP work with torch.cuda.make_graphed_callables

Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.3.0+cu121 documentation mentions that:
DDP registers an autograd hook for each parameter given by model.parameters() and the hook will fire when the corresponding gradient is computed in the backward pass.

torch.cuda.make_graphed_callables replaces the model forward and backward with cuda graph.

so, my question is, during backward, just a cuda graph is replay-ed, not going through the pytorch python/c++ code, how the autograd hook is triggered? Or this is a special solution that one allreduce is specially added for this case at the end of the whole model backward?

thanks

The CUDA Graph will capture the corresponding NCCL operations performed in the backward pass and will replay these afterwards.

thanks for the answer.

But I think NCCL is not involved by torch.cuda.make_graphed_callables which just captures the forward pass and backward pass without communication collective operations, and so no NCCL is captured into the cuda graph. (see the code below)

According to CUDA semantics — PyTorch 2.3 documentation
Call make_graphed_callables() on graphable network sections before wrapping the network with DDP.

the code looks like below:

s = torch.cuda.Stream(device=args.device)
with torch.cuda.stream(s):
    model = torch.cuda.make_graphed_callables(model, (real_inputs[0],))
    ddp_model = DDP(model)

btw, I know that current nccl supports cuda graph, but my purpose is to understand how this method (DDP + torch.cuda.make_graphed_callables) is implemented.

The link does not mention the section so I guess you are referring to this doc explaining how NCCL<2.9.6, which does not support full-graph capture, is handled?
If you are using such an old NCCL release for some reasons, then you would be correct and you won’t be able to capture communication calls.

Capture the full model using newer NCCL versions instead of parts of it to also capture the communication calls in case that’s what you are asking for.

thanks, our links are the same.

and my questions is, since the nccl is not captured, and the backward stage is replaced with cuda graph replay, how could the bwd hook be triggered to call the nccl operations?