Does the NCCL operation use the default stream as other computations?

lhb8125 · June 28, 2021, 7:06am

Does the NCCL operation use the default stream as other computations?

opened 04:46AM - 23 Jun 21 UTC

module: nccl triaged

## ❓ Questions and Help ### Does the overlap occur between communication and …computation? Let's take the NCCL backend as an example, if I launch a collective operation, and then another related computation: ``` dist.allreduce(tensor, op=dist.ReduceOp.SUM, group=group) tensor = tensor * 2 ``` Is a CUDA synchronization essential? ### Related documentation > Synchronous operation - the default mode, when async_op is set to False. When the function returns, it is guaranteed that the collective operation is performed. In the case of CUDA operations, it is not guaranteed that the CUDA operation is completed, since CUDA operations are asynchronous. For CPU collectives, any further function calls utilizing the output of the collective call will behave as expected. For CUDA collectives, function calls utilizing the output on the same CUDA stream will behave as expected. Users must take care of synchronization under the scenario of running under different streams. For details on CUDA semantics such as stream synchronization, see CUDA Semantics. See the below script to see examples of differences in these semantics for CPU and CUDA operations. Base on the above description, I guess synchronization is unnecessary. However, my previous investigation shows that the NCCL operations are launched on a separated stream, "ncclStream": [code](https://github.com/pytorch/pytorch/blob/ed1da5be210c31cc07b033ac0f19f3dd6366feac/torch/lib/c10d/ProcessGroupNCCL.cpp#L1073) So if I don't specify any stream, will the communication and computation be launched on the same stream?

pritamdamania87 · June 30, 2021, 11:19pm

Responded in Does the NCCL operation use the default stream as other computations? · Issue #60511 · pytorch/pytorch · GitHub, please don’t duplicate questions across forums and github.