Problem compiling pytorch in WITH_DISTRIBUTED=1 mode

dvaldes · March 15, 2017, 7:20pm

Hello everyone,

I send you this post to see if anyone can help me in compiling torch. It is very strange but when I compile pytorch without the WITH_DISTRIBUTED = 1 parameter the compilation seems to go well. But when I put the parameter WITH_DISTRIBUTED = 1 python3 setup.py build_deps gives me the following error that I attach below. I need this flag enabled for parallelization.

I am using a debian 8.6.0, cmake 3.7.0, and python-3.5.2 and I’m really very lost …

THE ERROR:

/usr/include/string.h:66:14: note:   ‘memset’
 extern void *memset (void *__s, int __c, size_t __n) __THROW __nonnull ((1));
              ^
CMakeFiles/THD.dir/build.make:398: recipe for target 'CMakeFiles/THD.dir/master_worker/master/THDStorage.cpp.o' failed
make[2]: *** [CMakeFiles/THD.dir/master_worker/master/THDStorage.cpp.o] Error 1
/soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::send(const thd::Scalar&, int)’:
/soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:329:52: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
            MPI_UINT8_T, dst_rank, 0, MPI_COMM_WORLD);
                                                    ^
In file included from /soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.hpp:5:0,
                 from /soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:1:
/usr/lib/openmpi/include/mpi.h:1384:20: note: initializing argument 1 of ‘int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm)’
 OMPI_DECLSPEC  int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,
                    ^
/soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::send(thpp::Tensor&, int)’:
/soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:340:52: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
            MPI_UINT8_T, dst_rank, 0, MPI_COMM_WORLD);
                                                    ^
In file included from /soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.hpp:5:0,
                 from /soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:1:
/usr/lib/openmpi/include/mpi.h:1384:20: note: initializing argument 1 of ‘int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm)’
 OMPI_DECLSPEC  int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,
                    ^
/soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual THDGroup thd::DataChannelMPI::newGroup(const std::vector<int>&)’:
/soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:476:56: error: invalid conversion from ‘const int*’ to ‘int*’ [-fpermissive]
   MPI_Group_incl(world_group, ranks.size(), ranks.data(), &ranks_group);
                                                        ^
In file included from /soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.hpp:5:0,
                 from /soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:1:
/usr/lib/openmpi/include/mpi.h:1269:20: note: initializing argument 3 of ‘int MPI_Group_incl(MPI_Group, int, int*, ompi_group_t**)’
 OMPI_DECLSPEC  int MPI_Group_incl(MPI_Group group, int n, int *ranks,
                    ^
/soft/pytorch-dist/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:479:66: error: ‘MPI_Comm_create_group’ was not declared in this scope
   MPI_Comm_create_group(MPI_COMM_WORLD, ranks_group, 0, &new_comm);
                                                                  ^
CMakeFiles/THD.dir/build.make:422: recipe for target 'CMakeFiles/THD.dir/master_worker/master/THDTensor.cpp.o' failed
make[2]: *** [CMakeFiles/THD.dir/master_worker/master/THDTensor.cpp.o] Error 1
CMakeFiles/THD.dir/build.make:158: recipe for target 'CMakeFiles/THD.dir/base/data_channels/DataChannelMPI.cpp.o' failed
make[2]: *** [CMakeFiles/THD.dir/base/data_channels/DataChannelMPI.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/THD.dir/all' failed
make[1]: *** [CMakeFiles/THD.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

I follow this instructions:

Using Python 3 (Python 3.4)

Install build dependencies

Essentials

sudo apt-get update
sudo apt-get install git build-essential
ccache

sudo apt-get install ccache
export CC="ccache gcc"
export CXX="ccache g++"

CMake

The default CMake version in Debian’s repositories is too old.

Ubuntu 16.10 has version 3.5.2 and it works fine.

wget https://cmake.org/files/v3.7/cmake-3.7.0.tar.gz
tar xf cmake-3.7.0.tar.gz
rm cmake-3.7.0.tar.gz
cd cmake-3.7.0
./bootstrap
make
sudo make install
cd ..

Install THD dependencies

Asio C++ Library

sudo apt-get install libasio-dev

MPI implementation

sudo apt-get install mpich

Set up Python

sudo apt-get install python3-dev python3-pip

Set up virtual environment

sudo pip3 install virtualenv
virtualenv venv
source venv/bin/activate

Install PyTorch

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HOME/pytorch-dist/torch/lib"

git clone https://github.com/apaszke/pytorch-dist/
cd pytorch-dist
pip3 install -r requirements.txt
WITH_DISTRIBUTED=1 python3 setup.py build_deps
WITH_DISTRIBUTED=1 python3 setup.py develop

Thanks a lot for your help.

Dani

apaszke · March 15, 2017, 11:06pm

Thanks, we’ll look into it. However, note that the distributed package is still in pre-alpha and will be likely slow or can break in weird ways. We’ll notify everyone once it’s ready for use.

dvaldes · March 17, 2017, 7:23pm

Thanks for your answer, Adam .

Exactly, which are the requirements for pytorch to work?

I mean, versions of the operating system, gcc, cmake, cuda, etc. etc.

We are a little lost, but we think that the problem in compilation comes from cuda…

Thanks a lot!

smth · March 17, 2017, 7:36pm

cmake and an anaconda based python install will get pytorch going with all required dependencies.
Optional dependencies are CUDA, CUDNN
More can be read here: https://github.com/pytorch/pytorch#from-source or you can look at the docker image: https://github.com/pytorch/pytorch#docker-image

WITH_DISTRIBUTED is not fully fleshed out or supported, we cant spec out the requirements yet.

apaszke · March 18, 2017, 9:43pm

The problem is because of MPI. Installing it might help.

kirk86 · April 11, 2017, 9:26am

Thank you guys for all your work. If I may, I would like to remind you also of users like me where in most of the cases we don’t have root privileges on the machines we operate upon (e.g. clusters). So, whenever the pytorch.distribute package rolls out, I would like to request, if possible to also make it available for easy installation via anaconda or pip which might take care also of the dependencies.

Thanks again, and keep up the good work!

Cheers.