MakeFiles/c10.dir/util/numa.cpp.o -c …/c10/util/numa.cpp
…/c10/util/numa.cpp:6:10: fatal error: numa.h: No such file or directory #include <numa.h>
^~~~~~~~
compilation terminated.
[1684/4092] Building CXX object third_p…akeFiles/dnnl_cpu.dir/cpu_reorder.cpp.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File “setup.py”, line 740, in
build_deps()
File “setup.py”, line 323, in build_deps
cmake=cmake)
File “/cluster/home/cnphuong/pytorch/tools/build_pytorch_libs.py”, line 62, in build_caffe2
cmake.build(my_env)
File “/cluster/home/cnphuong/pytorch/tools/setup_helpers/cmake.py”, line 340, in build
self.run(build_args, my_env)
File “/cluster/home/cnphuong/pytorch/tools/setup_helpers/cmake.py”, line 141, in run
check_call(command, cwd=self.build_dir, env=env)
File “/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/subprocess.py”, line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[‘cmake’, ‘–build’, ‘.’, ‘–target’, ‘install’, ‘–config’, ‘Release’, ‘–’, ‘-j’, ‘64’]’ returned non-zero exit status 1.
Hey @ph0123, DistributedDataParallel with CPU model should be supported by default in the release binaries. You can enable this mode by passing in a CPU model and do not provide a device_ids argument. If this is all you need, you don’t need to compile from source I think? Did you hit any error when trying to run DDP with CPU models?
Thanks, last time, I install from sources, and some errors when run the program with Distributed Parallel. I check the tutorial, and I have to install from source to run Distributed CPUs.
But now I read the documents again, it do not need install from sources.
**MPI Backend**
The Message Passing Interface (MPI) is a standardized tool from the field of high-performance computing. It allows to do point-to-point and collective communications and was the main inspiration for the API of `torch.distributed` . Several implementations of MPI exist (e.g. [Open-MPI](https://www.open-mpi.org/), [MVAPICH2](http://mvapich.cse.ohio-state.edu/), [Intel MPI](https://software.intel.com/en-us/intel-mpi-library)) each optimized for different purposes. The advantage of using the MPI backend lies in MPI’s wide availability - and high-level of optimization - on large computer clusters. [Some](https://developer.nvidia.com/mvapich) [recent](https://developer.nvidia.com/ibm-spectrum-mpi) [implementations](https://www.open-mpi.org/) are also able to take advantage of CUDA IPC and GPU Direct technologies in order to avoid memory copies through the CPU.
Unfortunately, PyTorch’s binaries can not include an MPI implementation and we’ll have to recompile it by hand. Fortunately, this process is fairly simple given that upon compilation, PyTorch will look *by itself* for an available MPI implementation. The following steps install the MPI backend, by installing PyTorch [from source](https://github.com/pytorch/pytorch#from-source).
1. Create and activate your Anaconda environment, install all the pre-requisites following [the guide](https://github.com/pytorch/pytorch#from-source), but do **not** run `python setup.py install` yet.
2. Choose and install your favorite MPI implementation. Note that enabling CUDA-aware MPI might require some additional steps. In our case, we’ll stick to Open-MPI *without* GPU support: `conda install -c conda-forge openmpi`
3. Now, go to your cloned PyTorch repo and execute `python setup.py install` .
In order to test our newly installed backend, a few modifications are required.
1. Replace the content under `if __name__ == '__main__':` with `init_process(0, 0, run, backend='mpi')` .
2. Run `mpirun -n 4 python myscript.py` .
The reason for these changes is that MPI needs to create its own environment before spawning the processes. MPI will also spawn its own processes and perform the handshake described in [Initialization Methods](https://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods), making the `rank` and `size` arguments of `init_process_group` superfluous. This is actually quite powerful as you can pass additional arguments to `mpirun` in order to tailor computational resources for each process. (Things like number of cores per process, hand-assigning machines to specific ranks, and [some more](https://www.open-mpi.org/faq/?category=running#mpirun-hostfile)) Doing so, you should obtain the same familiar output as with the other communication backends.