When will the callback be triggered in PyTorch DDP bucket allreduce?

Hello,
Can someone help me understand when the callback function is called in DDP reducer bucket allreduce? Previously I thought it’s called when the CUDA stream on GPU is done, but now I think it’s called when the work on CPU is done.
The callback fundtion is defined in torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:collective method.

    if (work->recordFunctionEndCallback_) {
      work->future_->addCallback([work](at::ivalue::Future& /* unused */) {
        work->recordFunctionEndCallback_();
      });
    }