Segfault using cuda with openmpi

theevann · December 14, 2017, 3:52pm

Ok, solved it.
Since I was a bit lost due to lack of documentation for using pytorch + mpi + gpu, I will give here the steps I followed.
The main thing I was missing is that I needed a openmpi which is "cuda-aware".

Main steps to follow:

Install “cuda-aware” openMPI : Need to compile from source
Install Pytorch from source

Step by step:

1. Install openMPI `--with-cuda`

If you have openMPI and you want to check if it is “cuda-aware”, run:

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value

If you get true, perfect, nothing to do.
If you get false, too bad, you need to recompile it.
(source : Link)

At this step, if you have opemMPI installed (or any other MPI implementation), I strongly advise to uninstall it in order not to mess things up with the paths …

Then, download the last openMPI version here and extract it:

wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz
gunzip -c openmpi-3.0.0.tar.gz | tar xf -

We then follow the steps from here, but we add the --with-cuda parameter to the ./configure command as explained here:

cd openmpi-3.0.0
./configure --prefix=/home/$USER/.openmpi --with-cuda
make all install

The --prefix parameter is the install path and is mandatory, Note that you need to choose a directory where you have write permissions (I didn’t have it in /usr/local, the one suggested in the link).

Now you can prepare yourself a cup of coffee (as proposed here) since this takes around 15 min.

Once that is done, you need to add to your path the openMPI bin directory, and the lib directory to the lib path :

 export PATH="$PATH:/home/$USER/.openmpi/bin"
 export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"

I recommend to add it to your .bashrc/.zshrc/… straight away.

Check that it’s working, this is more or less what you should get:

mpirun --version
> mpirun (Open MPI) 3.0.0

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
> mca:mpi:base:param:mpi_built_with_cuda_support:value:true

2. Install pytorch (from source)

Now, you just need to install pytorch from source (remove it properly and entirely first).
Be sure to run first:
conda update conda
Then, copying for convenience from github:

export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]

# Install basic dependencies
conda install numpy pyyaml mkl setuptools cmake cffi

# Add LAPACK support for the GPU
conda install -c pytorch magma-cuda80 # or magma-cuda75 if CUDA 7.5

Note that the pytorch read-me should probably be updated, since you need to use the pytorch channel in anaconda to get the last version of magma (that was the case for me).

And finally:

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch && python setup.py install

During compilation you should see something like the following, which most probably means that pytorch found our installation of openMPI:

Found openMP
Compiling with openMP support

I guess Pytorch automatically detects our installation since it is in our path, so be sure to have it before compiling. Be sure you don’t have two MPI installations, and it is the one you just installed that you run with mpirun or mpiexec.

Hope this may help others… (if it was obvious for some, it was not really me )

Note about using Cuda + openMPI with pytorch:
I had to manually set the device with torch.set_device(x). If I just use .cuda(x), it crashes with the same cryptic error message.

Segfault using cuda with openmpi

Main steps to follow:

Step by step:

1. Install openMPI --with-cuda

2. Install pytorch (from source)

1. Install openMPI `--with-cuda`