torch.cuda.make_graphed_callables replaces the model forward and backward with cuda graph.
so, my question is, during backward, just a cuda graph is replay-ed, not going through the pytorch python/c++ code, how the autograd hook is triggered? Or this is a special solution that one allreduce is specially added for this case at the end of the whole model backward?
But I think NCCL is not involved by torch.cuda.make_graphed_callables which just captures the forward pass and backward pass without communication collective operations, and so no NCCL is captured into the cuda graph. (see the code below)
s = torch.cuda.Stream(device=args.device)
with torch.cuda.stream(s):
model = torch.cuda.make_graphed_callables(model, (real_inputs[0],))
ddp_model = DDP(model)
btw, I know that current nccl supports cuda graph, but my purpose is to understand how this method (DDP + torch.cuda.make_graphed_callables) is implemented.
The link does not mention the section so I guess you are referring to this doc explaining how NCCL<2.9.6, which does not support full-graph capture, is handled?
If you are using such an old NCCL release for some reasons, then you would be correct and you won’t be able to capture communication calls.
Capture the full model using newer NCCL versions instead of parts of it to also capture the communication calls in case that’s what you are asking for.
and my questions is, since the nccl is not captured, and the backward stage is replaced with cuda graph replay, how could the bwd hook be triggered to call the nccl operations?