opened 04:46AM - 23 Jun 21 UTC
module: nccl
triaged
## ❓ Questions and Help
### Does the overlap occur between communication and …computation?
Let's take the NCCL backend as an example, if I launch a collective operation, and then another related computation:
```
dist.allreduce(tensor, op=dist.ReduceOp.SUM, group=group)
tensor = tensor * 2
```
Is a CUDA synchronization essential?
### Related documentation
> Synchronous operation - the default mode, when async_op is set to False. When the function returns, it is guaranteed that the collective operation is performed. In the case of CUDA operations, it is not guaranteed that the CUDA operation is completed, since CUDA operations are asynchronous. For CPU collectives, any further function calls utilizing the output of the collective call will behave as expected. For CUDA collectives, function calls utilizing the output on the same CUDA stream will behave as expected. Users must take care of synchronization under the scenario of running under different streams. For details on CUDA semantics such as stream synchronization, see CUDA Semantics. See the below script to see examples of differences in these semantics for CPU and CUDA operations.
Base on the above description, I guess synchronization is unnecessary.
However, my previous investigation shows that the NCCL operations are launched on a separated stream, "ncclStream":
[code](https://github.com/pytorch/pytorch/blob/ed1da5be210c31cc07b033ac0f19f3dd6366feac/torch/lib/c10d/ProcessGroupNCCL.cpp#L1073)
So if I don't specify any stream, will the communication and computation be launched on the same stream?