Behavior of wait() on async CUDA collectives

I am confused with understanding the behavior of wait() on async CUDA collectives.

As described in torch distributed documentation:

  • wait() - […] In the case of CUDA collectives, will block until the operation has been successfully enqueued onto a CUDA stream and the output can be utilized on the default stream without further synchronization.

From that paragraph, I understand that wait() don’t wait for the completion of the collective, but only waits for the operation to be enqueued.

However from experiences I’ve done it seems that it does block the other collectives from starting. For example in this code:

r1 = dist.all_to_all_single(dst, src, group=pg1, async_op=True)
r1.wait()
r2 = dist.all_to_all_single(dst, src, group=pg2, async_op=True)
r2.wait()

I would expect the wait() to only enqueue the collective to the stream, then for the second collective to be enqueued too and start running in parallel (as they are on different PGs). However the second collective does wait for the first one to finish execution.

So what is the real behaviour of wait() at the GPU level ? Will all the collectives that are posted after if wait for all the collectives that are posted before, even if they are on a different PG ? By posted I mean calling the dist.collective() function like dist.all_to_all_single.

All process groups enqueue collectives under the same stream by default, so that is why the second collective always comes after the first. If they were on different streams you would see different behavior.