Error install from source on server

ph0123 · April 28, 2020, 9:38pm

HI,
I follow instructions from pytorch git as below.

install dependencies
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi
(cmake error: conda install -c anaconda cmake)

clone pytorch
git clone --recursive GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration
cd pytorch

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-“$(dirname $(which conda))/…/”}
python setup.py install

The error is as below.

MakeFiles/c10.dir/util/numa.cpp.o -c …/c10/util/numa.cpp
…/c10/util/numa.cpp:6:10: fatal error: numa.h: No such file or directory
#include <numa.h>
^~~~~~~~
compilation terminated.
[1684/4092] Building CXX object third_p…akeFiles/dnnl_cpu.dir/cpu_reorder.cpp.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File “setup.py”, line 740, in
build_deps()
File “setup.py”, line 323, in build_deps
cmake=cmake)
File “/cluster/home/cnphuong/pytorch/tools/build_pytorch_libs.py”, line 62, in build_caffe2
cmake.build(my_env)
File “/cluster/home/cnphuong/pytorch/tools/setup_helpers/cmake.py”, line 340, in build
self.run(build_args, my_env)
File “/cluster/home/cnphuong/pytorch/tools/setup_helpers/cmake.py”, line 141, in run
check_call(command, cwd=self.build_dir, env=env)
File “/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/subprocess.py”, line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[‘cmake’, ‘–build’, ‘.’, ‘–target’, ‘install’, ‘–config’, ‘Release’, ‘–’, ‘-j’, ‘64’]’ returned non-zero exit status 1.

When run setup.py I saw some output here.

mrshenli · April 28, 2020, 10:26pm

Could you please create an issue on github to track this?

And this does not seem to be relevant to torch.diistributed?

ph0123 · April 29, 2020, 12:44am

hi,
Because Distributed CPUs is only supported by building from source. I think most of people with this tag have more experiments than others.

Thanks,

mrshenli · April 29, 2020, 2:08am

Hey @ph0123, DistributedDataParallel with CPU model should be supported by default in the release binaries. You can enable this mode by passing in a CPU model and do not provide a device_ids argument. If this is all you need, you don’t need to compile from source I think? Did you hit any error when trying to run DDP with CPU models?

ph0123 · April 29, 2020, 8:24am

Dear mrshenli,

Thanks, last time, I install from sources, and some errors when run the program with Distributed Parallel. I check the tutorial, and I have to install from source to run Distributed CPUs.

But now I read the documents again, it do not need install from sources.

I will try. Thank you so much!
Thanks,

ph0123 · April 29, 2020, 8:42am

Hi,
Please see

**MPI Backend**

The Message Passing Interface (MPI) is a standardized tool from the field of high-performance computing. It allows to do point-to-point and collective communications and was the main inspiration for the API of `torch.distributed` . Several implementations of MPI exist (e.g. [Open-MPI](https://www.open-mpi.org/), [MVAPICH2](http://mvapich.cse.ohio-state.edu/), [Intel MPI](https://software.intel.com/en-us/intel-mpi-library)) each optimized for different purposes. The advantage of using the MPI backend lies in MPI’s wide availability - and high-level of optimization - on large computer clusters. [Some](https://developer.nvidia.com/mvapich) [recent](https://developer.nvidia.com/ibm-spectrum-mpi) [implementations](https://www.open-mpi.org/) are also able to take advantage of CUDA IPC and GPU Direct technologies in order to avoid memory copies through the CPU.

Unfortunately, PyTorch’s binaries can not include an MPI implementation and we’ll have to recompile it by hand. Fortunately, this process is fairly simple given that upon compilation, PyTorch will look *by itself* for an available MPI implementation. The following steps install the MPI backend, by installing PyTorch [from source](https://github.com/pytorch/pytorch#from-source).

1. Create and activate your Anaconda environment, install all the pre-requisites following [the guide](https://github.com/pytorch/pytorch#from-source), but do **not** run `python setup.py install` yet.
2. Choose and install your favorite MPI implementation. Note that enabling CUDA-aware MPI might require some additional steps. In our case, we’ll stick to Open-MPI *without* GPU support: `conda install -c conda-forge openmpi`
3. Now, go to your cloned PyTorch repo and execute `python setup.py install` .

In order to test our newly installed backend, a few modifications are required.

1. Replace the content under `if __name__ == '__main__':` with `init_process(0, 0, run, backend='mpi')` .
2. Run `mpirun -n 4 python myscript.py` .

The reason for these changes is that MPI needs to create its own environment before spawning the processes. MPI will also spawn its own processes and perform the handshake described in [Initialization Methods](https://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods), making the `rank` and `size` arguments of `init_process_group` superfluous. This is actually quite powerful as you can pass additional arguments to `mpirun` in order to tailor computational resources for each process. (Things like number of cores per process, hand-assigning machines to specific ranks, and [some more](https://www.open-mpi.org/faq/?category=running#mpirun-hostfile)) Doing so, you should obtain the same familiar output as with the other communication backends.

Thanks,

mrshenli · April 29, 2020, 2:03pm

Oh I see, you are trying to use MPI. Is MPI the only option, or will Gloo or NCCL also be acceptable?

And yes, MPI backend needs building from source.

mrshenli · April 29, 2020, 2:08pm

BTW, the build log shown here does not seem to be complete. Could you please also paste the last few screens of logs?

ph0123 · April 29, 2020, 6:06pm

Hi,
I try to work with MPI backend but it did not work.
I change to Gloo backend.
Thanks,