MPI Pytorch backend

zeneofa · May 31, 2018, 9:35pm

I am building pytorch from source, what are the correct environmental variables to set so that pytorch builds with MPI? My command is as follows:

CC=$(which gcc) CXX=$(which g++) CUDA_HOME=$CUDAPATH NCCL_ROOT_DIR=$CUDAPATH CMAKE_LIBRARY_PATH="$CONDA_PREFIX/lib" CMAKE_INCLUDE_PATH="$CONDA_PREFIX/include/" WITH_DISTRIBUTED=1 python setup.py install | tee log.txt

The build process, from the log, appears to find my mpi libraries:

-- NCCL_MAJOR_VERSION: 2
-- Found NCCL ......
-- Found MPI_C: ....
-- Found MPI_CXX:.....
-- Found MPI: TRUE (found version "3.1")
-- Found Gloo: TRUE

The initial summary report however has:

--   USE_MPI               : OFF
--   USE_NCCL              : ON
--     USE_SYSTEM_NCCL     : OFF
--   USE_NERVANA_GPU       : OFF
--   USE_NNPACK            : OFF
--   USE_OBSERVERS         : OFF
--   USE_OPENCL            : OFF
--   USE_OPENCV            : OFF
--   USE_OPENMP            : OFF
--   USE_PROF              : OFF
--   USE_REDIS             : OFF
--   USE_ROCKSDB           : OFF
--   USE_ZMQ               : OFF
--   USE_DISTRIBUTED       : OFF

I have even tried adding USE_MPI=1, but still the build summary says USE_MPI OFF and USE_DISTRIBUTED OFF.

I have mpirun in my path and ld_library_path.

For reference I am working on a power pc system, using spectrum mpi, cuda 9.2, nccl2 appears to work (not sure how to test this though) and I do not have sudo.

From the forum I see that mpi support is still experimental, though there are a few examples scattered of people using DistributedDataParallel, which is my goal. On the cluster I am using, the gloo backend works within node, but cannot communicate between nodes (I am waiting to confirm this with our sys admins), it appears as though MPI is the only communication that is allowed between nodes.

Am I looking at the wrong thing part of the build log? The reason I ask is the mpi communicator dies:

*** An error occurred in MPI_Finalize
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[a36n07:11508] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate     error messages, and not able to guarantee that all other pr!

The code still continues to run and produces output, but I am concerned that this will effect performance and or the quality of the result. Is mpi communication actually still happening?

So any advice would be much appreciated.