MPI cuda stream

When we do

with torch.cuda.stream(stream):
    torch.distributed.isend(...)

Will if effect the stream (cuda-aware) MPI uses for communication, or rather its some inside MPI implementation detail?

Given the implementation below, it does not seem ProcessGroupMPI uses any dedicated CUDA streams. So I would assume it’s fully delegated to MPI’s implementation?

1 Like

You made me open the black box :slight_smile:

I verified its an inside MPI implementation detail. I found it in their code.
for example openmpi
they use their own streams.

This is critical,
because unless the streams they create can be accessed somehow (so far I did not find a way to do it in the cuda manual. but ill look deeper),
it means that the only way to change the behavior is editing the MPI C code and compiling.

Why do normal pytorch users should care?
Because for normal cuda-aware usage this is super-duper risky, as their streams don’t wait for the our streams, meaning cuda-aware MPI is prune to failure unless we fully synchronize our streams before each MPI call.
This would result in slower (or incorrect) program.
(I personally spent tons of time debugging this…)

I’d like to know what you think, maybe we should open an issue.

By the way, I wonder why in the file you mention irecv uses MPI_ANY_SOURCE:
is that intentional?

I am not aware of the history here. @pietern and @teng-li would know more.

Why do normal pytorch users should care?
Because for normal cuda-aware usage this is super-duper risky, as their streams don’t wait for the our streams, meaning cuda-aware MPI is prune to failure unless we fully synchronize our streams before each MPI call.
This would result in slower (or incorrect) program.

I agree, full synchronization is not acceptable here. Can MPI take a CUDA stream as an argument and then work on that stream like NCCL does? If this is possible we can let ProcessGroupMPI manage the streams and use CUDA event to synchronize.

Created an issue in openmpi repo.

@mrshenli As far as I know, MPI doesn’t support what you suggest so better ask them directly?

1 Like

@mrshenli
I think what https://github.com/open-mpi/ompi/issues/7733#issuecomment-629806195
Suggests is what should be implemented inside Pytorch in cpp, if we want to use MPI process group correctly.

Maybe add optional event argument to torch.dist calls (cuda_event_to_sync_with or something).

As far as I know, callbacks on cuda events are not exposed to python API.
(Too bad they aren’t, actually)

I see. Thanks for sharing!

Maybe add optional event argument to torch.dist calls ( cuda_event_to_sync_with or something).

I am not sure whether we should add this to c10d Python API if this is only required by the MPI backend. Could you please add an issue on GH to kick off the discussion on the pitch? Let’s discuss there to see what are the options.

As far as I know, callbacks on cuda events are not exposed to python API.
(Too bad they aren’t, actually)

We are actually exploring CUDA event callback for RPC, and also considering using it to handle CUDA errors. Let me create an issue to track this.