Libtorch mpi distribution?

few days ago, I plan to study libtorch distributed training(MPI backend), and found an example dist-mnist examples/cpp/distributed/dist-mnist.cpp at main · pytorch/examples · GitHub, in which ProcessGroupMPI is used for distribution.

When I compile the source code, compiler reported two error messages:

  • undefined reference to `c10d::ProcessGroupMPI::abort()'
  • undefined reference to `c10d::ProcessGroupMPI::createProcessGroupMPI(std::vector<int, std::allocator<int> >)'

seems that mpi-distribution codes are not compiled in pre-compiled libtorch package, and/or which library I should link?

Thanks.

PyTorch binaries do not include an MPI implementation and you will have to build from source. There are many different implementations you can choose from:

https://pytorch.org/tutorials/intermediate/dist_tuto.html#:~:text=nvidia-smi.-,MPI%20Backend,-The%20Message%20Passing

When building you should ensure that this flag USE_MPI=1, pytorch/setup.py at main · pytorch/pytorch · GitHub

1 Like

Thank you. Based on your suggestions, I manually compiled the PyTorch source code and enabled the GLOO and MPI backends. Now I can complete the compilation and linking process without any error messages.Thanks again.

1 Like