Why torch.distributed.all_reduce with nccl backend issues so many D2H and H2D Memcpy and runs slow?

I run torch.distributed.all_reduce in my code with the nccl backend.
I find that it takes very long to finish (about 2s). Then I profile the code and find that during the all_reduce, there are a lot of memory copy operations between the device and the host.
The profiling result is shown below:

I also find that when I call all_reduce twice, the second one will be much faster.
Can anyone help me with this problem?

These are the flow events during all_reduce (may contain some other events before and after all_reduce):

Hi @puddingfjz. In PyTorch, nccl is initialized lazily, so on the first collective (all_reduce in your case) the nccl communicators are created (Creating a Communicator — NCCL 2.20.3 documentation) which could explain those operations.

When profiling its usually common practice to take a profile after a few steps into training so setup and cache warmup do not play a factor in the traces

Thanks for your answer! I will check it.