MPI cuda stream

seliad · May 11, 2020, 1:31pm

When we do

with torch.cuda.stream(stream):
    torch.distributed.isend(...)

Will if effect the stream (cuda-aware) MPI uses for communication, or rather its some inside MPI implementation detail?

mrshenli · May 11, 2020, 4:03pm

Given the implementation below, it does not seem ProcessGroupMPI uses any dedicated CUDA streams. So I would assume it’s fully delegated to MPI’s implementation?

github.com

pytorch/pytorch/blob/f314d9a0774062a20015ae522d33eadd45293328/torch/lib/c10d/ProcessGroupMPI.cpp#L791-L814


std::shared_ptr<ProcessGroup::Work> ProcessGroupMPI::send(
    std::vector<at::Tensor>& tensors,
    int dstRank,
    int tag) {
  checkSingleTensor(tensors);

  auto& tensor = tensors[0];
  MPI_Request request = MPI_REQUEST_NULL;

  {
    c10::DeviceGuard guard(tensor.device());
    std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
    MPI_CHECK(MPI_Isend(
        tensor.data_ptr(),
        tensor.numel(),
        mpiDatatype.at(tensor.scalar_type()),
        dstRank,
        tag,
        pgComm_,
        &request));

This file has been truncated. show original

seliad · May 11, 2020, 9:38pm

You made me open the black box

I verified its an inside MPI implementation detail. I found it in their code.
for example openmpi
they use their own streams.

This is critical,
because unless the streams they create can be accessed somehow (so far I did not find a way to do it in the cuda manual. but ill look deeper),
it means that the only way to change the behavior is editing the MPI C code and compiling.

Why do normal pytorch users should care?
Because for normal cuda-aware usage this is super-duper risky, as their streams don’t wait for the our streams, meaning cuda-aware MPI is prune to failure unless we fully synchronize our streams before each MPI call.
This would result in slower (or incorrect) program.
(I personally spent tons of time debugging this…)

I’d like to know what you think, maybe we should open an issue.

By the way, I wonder why in the file you mention irecv uses MPI_ANY_SOURCE:
is that intentional?

github.com

pytorch/pytorch/blob/f314d9a0774062a20015ae522d33eadd45293328/torch/lib/c10d/ProcessGroupMPI.cpp#L856


  auto& tensor = tensors[0];
  MPI_Request request = MPI_REQUEST_NULL;

  {
    c10::DeviceGuard guard(tensor.device());
    std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
    MPI_CHECK(MPI_Irecv(
        tensor.data_ptr(),
        tensor.numel(),
        mpiDatatype.at(tensor.scalar_type()),
        MPI_ANY_SOURCE,
        tag,
        pgComm_,
        &request));
  }

  return std::make_shared<AsyncWork>(tensor, request);
}

std::shared_ptr<ProcessGroup::Work> ProcessGroupMPI::barrier(
    const BarrierOptions& opts) {

mrshenli · May 12, 2020, 2:21pm

I am not aware of the history here. @pietern and @teng-li would know more.

Why do normal pytorch users should care?
Because for normal cuda-aware usage this is super-duper risky, as their streams don’t wait for the our streams, meaning cuda-aware MPI is prune to failure unless we fully synchronize our streams before each MPI call.
This would result in slower (or incorrect) program.

I agree, full synchronization is not acceptable here. Can MPI take a CUDA stream as an argument and then work on that stream like NCCL does? If this is possible we can let ProcessGroupMPI manage the streams and use CUDA event to synchronize.

seliad · May 13, 2020, 2:26pm

Created an issue in openmpi repo.

@mrshenli As far as I know, MPI doesn’t support what you suggest so better ask them directly?

seliad · May 18, 2020, 5:33am

@mrshenli
I think what https://github.com/open-mpi/ompi/issues/7733#issuecomment-629806195
Suggests is what should be implemented inside Pytorch in cpp, if we want to use MPI process group correctly.

Maybe add optional event argument to torch.dist calls (cuda_event_to_sync_with or something).

As far as I know, callbacks on cuda events are not exposed to python API.
(Too bad they aren’t, actually)

mrshenli · May 18, 2020, 6:29pm

I see. Thanks for sharing!

Maybe add optional event argument to torch.dist calls ( cuda_event_to_sync_with or something).

I am not sure whether we should add this to c10d Python API if this is only required by the MPI backend. Could you please add an issue on GH to kick off the discussion on the pitch? Let’s discuss there to see what are the options.

As far as I know, callbacks on cuda events are not exposed to python API.
(Too bad they aren’t, actually)

We are actually exploring CUDA event callback for RPC, and also considering using it to handle CUDA errors. Let me create an issue to track this.

mrshenli · May 18, 2020, 6:38pm