Error install from source on server

I follow instructions from pytorch git as below.

  1. install dependencies
    conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi
    (cmake error: conda install -c anaconda cmake)

  2. clone pytorch
    git clone --recursive
    cd pytorch

  3. export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/…/"}
    python install

The error is as below.

MakeFiles/c10.dir/util/numa.cpp.o -c …/c10/util/numa.cpp
…/c10/util/numa.cpp:6:10: fatal error: numa.h: No such file or directory
#include <numa.h>
compilation terminated.
[1684/4092] Building CXX object third_p…akeFiles/dnnl_cpu.dir/cpu_reorder.cpp.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File “”, line 740, in
File “”, line 323, in build_deps
File “/cluster/home/cnphuong/pytorch/tools/”, line 62, in build_caffe2
File “/cluster/home/cnphuong/pytorch/tools/setup_helpers/”, line 340, in build, my_env)
File “/cluster/home/cnphuong/pytorch/tools/setup_helpers/”, line 141, in run
check_call(command, cwd=self.build_dir, env=env)
File “/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/”, line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[‘cmake’, ‘–build’, ‘.’, ‘–target’, ‘install’, ‘–config’, ‘Release’, ‘–’, ‘-j’, ‘64’]’ returned non-zero exit status 1.

When run I saw some output here.

Could you please create an issue on github to track this?

And this does not seem to be relevant to torch.diistributed?

Because Distributed CPUs is only supported by building from source. I think most of people with this tag have more experiments than others.


Hey @ph0123, DistributedDataParallel with CPU model should be supported by default in the release binaries. You can enable this mode by passing in a CPU model and do not provide a device_ids argument. If this is all you need, you don’t need to compile from source I think? Did you hit any error when trying to run DDP with CPU models?

Dear mrshenli,

Thanks, last time, I install from sources, and some errors when run the program with Distributed Parallel. I check the tutorial, and I have to install from source to run Distributed CPUs.

But now I read the documents again, it do not need install from sources.

I will try. Thank you so much!

Please see

**MPI Backend**

The Message Passing Interface (MPI) is a standardized tool from the field of high-performance computing. It allows to do point-to-point and collective communications and was the main inspiration for the API of `torch.distributed` . Several implementations of MPI exist (e.g. [Open-MPI](, [MVAPICH2](, [Intel MPI]( each optimized for different purposes. The advantage of using the MPI backend lies in MPI’s wide availability - and high-level of optimization - on large computer clusters. [Some]( [recent]( [implementations]( are also able to take advantage of CUDA IPC and GPU Direct technologies in order to avoid memory copies through the CPU.

Unfortunately, PyTorch’s binaries can not include an MPI implementation and we’ll have to recompile it by hand. Fortunately, this process is fairly simple given that upon compilation, PyTorch will look *by itself* for an available MPI implementation. The following steps install the MPI backend, by installing PyTorch [from source](

1. Create and activate your Anaconda environment, install all the pre-requisites following [the guide](, but do **not** run `python install` yet.
2. Choose and install your favorite MPI implementation. Note that enabling CUDA-aware MPI might require some additional steps. In our case, we’ll stick to Open-MPI *without* GPU support: `conda install -c conda-forge openmpi`
3. Now, go to your cloned PyTorch repo and execute `python install` .

In order to test our newly installed backend, a few modifications are required.

1. Replace the content under `if __name__ == '__main__':` with `init_process(0, 0, run, backend='mpi')` .
2. Run `mpirun -n 4 python` .

The reason for these changes is that MPI needs to create its own environment before spawning the processes. MPI will also spawn its own processes and perform the handshake described in [Initialization Methods](, making the `rank` and `size` arguments of `init_process_group` superfluous. This is actually quite powerful as you can pass additional arguments to `mpirun` in order to tailor computational resources for each process. (Things like number of cores per process, hand-assigning machines to specific ranks, and [some more]( Doing so, you should obtain the same familiar output as with the other communication backends.


Oh I see, you are trying to use MPI. Is MPI the only option, or will Gloo or NCCL also be acceptable?

And yes, MPI backend needs building from source.

BTW, the build log shown here does not seem to be complete. Could you please also paste the last few screens of logs?

I try to work with MPI backend but it did not work.
I change to Gloo backend.

1 Like