Hey guys, I am experimenting with MPI and mpi4py to parallelize on multiple processes/single machine (linux) a basic neural network.
I understand that PyTorch should be compiled from source when using MPI backend so I followed the guidelines and when doing
python setup.py install
I get the following terminal error and I cannot figure out a solution:
Building wheel torch-2.2.0a0+git3cf5348
-- Building version 2.2.0a0+git3cf5348
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/rohitkumar/github_code/distributed-neural-network/pytorch/torch -DCMAKE_PREFIX_PATH=/home/rohitkumar/.conda/envs/december/lib/python3.10/site-packages;/home/rohitkumar/.conda/envs/december -DGLIBCXX_USE_CXX11_ABI=1 -DNUMPY_INCLUDE_DIR=/home/rohitkumar/.conda/envs/december/lib/python3.10/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/home/rohitkumar/.conda/envs/december/bin/python3 -DPYTHON_INCLUDE_DIR=/home/rohitkumar/.conda/envs/december/include/python3.10 -DPYTHON_LIBRARY=/home/rohitkumar/.conda/envs/december/lib/libpython3.10.a -DTORCH_BUILD_VERSION=2.2.0a0+git3cf5348 -DUSE_NUMPY=True /home/rohitkumar/github_code/distributed-neural-network/pytorch
CMake Error: Error: generator : Ninja
Does not match the generator used previously: Unix Makefiles
Either remove the CMakeCache.txt file and CMakeFiles directory or choose a different binary directory.
When checking in the cloned āpytorchā directory, I cannot find any CMakeCache.txt file or any CMakeFiles directory.
CMake Warning at CMakeLists.txt:36 (message):
C++ standard version definition detected in environment variable.PyTorch
requires -std=c++17. Please remove -std=c++ settings in your environment.
-- /home/rohitkumar/.conda/envs/december/bin/x86_64-conda-linux-gnu-g++ /home/rohitkumar/github_code/distributed-neural-network/pytorch/torch/abi-check.cpp -o /home/rohitkumar/github_code/distributed-neural-network/pytorch/build/abi-check
/home/rohitkumar/.conda/envs/december/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: /home/rohitkumar/.conda/envs/december/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/lib/../lib64/libstdc++.so: undefined reference to `memcpy@GLIBC_2.14'
/home/rohitkumar/.conda/envs/december/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: /home/rohitkumar/.conda/envs/december/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/lib/../lib64/libstdc++.so: undefined reference to `aligned_alloc@GLIBC_2.16'
/home/rohitkumar/.conda/envs/december/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: /home/rohitkumar/.conda/envs/december/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/lib/../lib64/libstdc++.so: undefined reference to `clock_gettime@GLIBC_2.17'
collect2: error: ld returned 1 exit status
CMake Error at cmake/CheckAbi.cmake:16 (message):
Could not compile ABI Check: 1
Call Stack (most recent call first):
CMakeLists.txt:52 (include)
I tried adding the following flags to the cmake command:
Regardless, I tried cleaning the build environment and rerun python setup.py install, which now gives me this error (this is only the end of the log): gist:24576bfca6f8b44bb87d7e21e7eacd2b Ā· GitHub
Please tell me if there is anything else I can try, I am clueless, thanks!
Hey, Iām also facing the āaten/src/ATen/UfuncCPUKernel_add.cpp.DEFAULT.cpp.o [-w dupbuild=err]ā issue. did you guys manage to find a solution @rohitdat@ptrblck ?
when I build pytorch inside the pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel, [i cloned the v2.1.0 version], it gives me the following error:
--
-- ******** Summary ********
-- General:
-- CMake version : 3.26.4
-- CMake command : /opt/conda/envs/nano_paging/bin/cmake
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- C++ compiler id : GNU
-- C++ compiler version : 9.4.0
-- Using ccache if found : ON
-- Found ccache : CCACHE_PROGRAM-NOTFOUND
-- CXX flags : -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow
-- Build type : Release
-- Compile definitions : ONNX_ML=1;ONNXIFI_ENABLE_EXT=1;ONNX_NAMESPACE=onnx_torch;IDEEP_USE_MKL;HAVE_MMAP=1;_FILE_OFFSET_BITS=64;HAVE_SHM_OPEN=1;HAVE_SHM_UNLINK=1;HAVE_MALLOC_USABLE_SIZE=1;USE_EXTERNAL_MZCRC;MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS;BUILD_NVFUSER
-- CMAKE_PREFIX_PATH : /opt/conda/envs/nano_paging/lib/python3.10/site-packages;/opt/conda/envs/nano_paging;/usr/local/cuda;/usr/local/cuda
-- CMAKE_INSTALL_PREFIX : /ramyapra/pytorch/torch
-- USE_GOLD_LINKER : OFF
--
-- TORCH_VERSION : 2.1.0
-- BUILD_CAFFE2 : 0
-- BUILD_CAFFE2_OPS : OFF
-- BUILD_STATIC_RUNTIME_BENCHMARK: OFF
-- BUILD_TENSOREXPR_BENCHMARK: OFF
-- BUILD_NVFUSER_BENCHMARK: OFF
-- BUILD_BINARY : OFF
-- BUILD_CUSTOM_PROTOBUF : ON
-- Link local protobuf : ON
-- BUILD_DOCS : OFF
-- BUILD_PYTHON : True
-- Python version : 3.10.13
-- Python executable : /opt/conda/envs/nano_paging/bin/python
-- Pythonlibs version : 3.10.13
-- Python library : /opt/conda/envs/nano_paging/lib/libpython3.10.a
-- Python includes : /opt/conda/envs/nano_paging/include/python3.10
-- Python site-packages: lib/python3.10/site-packages
-- BUILD_SHARED_LIBS : ON
-- CAFFE2_USE_MSVC_STATIC_RUNTIME : OFF
-- BUILD_TEST : False
-- BUILD_JNI : OFF
-- BUILD_MOBILE_AUTOGRAD : OFF
-- BUILD_LITE_INTERPRETER: OFF
-- INTERN_BUILD_MOBILE :
-- TRACING_BASED : OFF
-- USE_BLAS : 1
-- BLAS : mkl
-- BLAS_HAS_SBGEMM :
-- USE_LAPACK : 1
-- LAPACK : mkl
-- USE_ASAN : OFF
-- USE_TSAN : OFF
-- USE_CPP_CODE_COVERAGE : OFF
-- USE_CUDA : ON
-- Split CUDA :
-- CUDA static link : OFF
-- USE_CUDNN : ON
-- USE_EXPERIMENTAL_CUDNN_V8_API: ON
-- USE_CUSPARSELT : OFF
-- CUDA version : 12.1
-- USE_FLASH_ATTENTION : ON
-- cuDNN version : 8.9.0
-- CUDA root directory : /usr/local/cuda
-- CUDA library : /usr/lib/x86_64-linux-gnu/libcuda.so
-- cudart library : /usr/local/cuda/lib64/libcudart.so
-- cublas library : /usr/local/cuda/lib64/libcublas.so
-- cufft library : /usr/local/cuda/lib64/libcufft.so
-- curand library : /usr/local/cuda/lib64/libcurand.so
-- cusparse library : /usr/local/cuda/lib64/libcusparse.so
-- cuDNN library : /usr/lib/x86_64-linux-gnu/libcudnn.so
-- nvrtc : /usr/local/cuda/lib64/libnvrtc.so
-- CUDA include path : /usr/local/cuda/include
-- NVCC executable : /usr/local/cuda/bin/nvcc
-- CUDA compiler : /usr/local/cuda/bin/nvcc
-- CUDA flags : -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_80,code=sm_80 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__
-- CUDA host compiler :
-- CUDA --device-c : OFF
-- USE_TENSORRT : OFF
-- USE_ROCM : OFF
-- BUILD_NVFUSER : ON
-- USE_EIGEN_FOR_BLAS :
-- USE_FBGEMM : ON
-- USE_FAKELOWP : OFF
-- USE_KINETO : ON
-- USE_FFMPEG : OFF
-- USE_GFLAGS : OFF
-- USE_GLOG : OFF
-- USE_LEVELDB : OFF
-- USE_LITE_PROTO : OFF
-- USE_LMDB : OFF
-- USE_METAL : OFF
-- USE_PYTORCH_METAL : OFF
-- USE_PYTORCH_METAL_EXPORT : OFF
-- USE_MPS : OFF
-- USE_FFTW : OFF
-- USE_MKL : ON
-- USE_MKLDNN : ON
-- USE_MKLDNN_ACL : OFF
-- USE_MKLDNN_CBLAS : OFF
-- USE_UCC : OFF
-- USE_ITT : ON
-- USE_NCCL : ON
-- USE_SYSTEM_NCCL : OFF
-- USE_NCCL_WITH_UCC : OFF
-- USE_NNPACK : ON
-- USE_NUMPY : ON
-- USE_OBSERVERS : ON
-- USE_OPENCL : OFF
-- USE_OPENCV : OFF
-- USE_OPENMP : ON
-- USE_TBB : OFF
-- USE_MIMALLOC : OFF
-- USE_VULKAN : OFF
-- USE_PROF : OFF
-- USE_QNNPACK : ON
-- USE_PYTORCH_QNNPACK : ON
-- USE_XNNPACK : ON
-- USE_REDIS : OFF
-- USE_ROCKSDB : OFF
-- USE_ZMQ : OFF
-- USE_DISTRIBUTED : ON
-- USE_MPI : OFF
-- USE_GLOO : ON
-- USE_GLOO_WITH_OPENSSL : OFF
-- USE_TENSORPIPE : ON
-- Public Dependencies : caffe2::mkl
-- Private Dependencies : Threads::Threads;pthreadpool;cpuinfo;qnnpack;pytorch_qnnpack;nnpack;XNNPACK;fbgemm;ittnotify;fp16;caffe2::openmp;tensorpipe;gloo;foxi_loader;rt;fmt::fmt-header-only;kineto;gcc_s;gcc;dl
-- Public CUDA Deps. : caffe2::cufft;caffe2::curand;caffe2::cublas
-- Private CUDA Deps. : torch::cudnn;__caffe2_nccl;tensorpipe_cuda;gloo_cuda;/usr/local/cuda/lib64/libcudart.so;CUDA::cusparse;CUDA::curand;CUDA::cufft;ATEN_CUDA_FILES_GEN_LIB
-- USE_COREML_DELEGATE : OFF
-- BUILD_LAZY_TS_BACKEND : ON
-- TORCH_DISABLE_GPU_ASSERTS : ON
-- Performing Test HAS_WMISSING_PROTOTYPES
-- Performing Test HAS_WMISSING_PROTOTYPES - Failed
-- Performing Test HAS_WERROR_MISSING_PROTOTYPES
-- Performing Test HAS_WERROR_MISSING_PROTOTYPES - Failed
-- Configuring done (68.2s)
-- Generating done (3.2s)
CMake Error:
Running
'/opt/conda/envs/nano_paging/bin/ninja' '-C' '/ramyapra/pytorch/build' '-t' 'recompact'
failed with:
ninja: error: build.ninja:57464: multiple rules generate caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.DEFAULT.cpp.o [-w dupbuild=err]
i looked at the compile_commands.json file that gets generated and there indeed are two sets of commands that build this onject file, but i cannot figure out why that is happening
Hey, unfortunately after many trials, I gave up on building PyTorch from source.
My purpose was to build it from source to use MPI as its backend for distributed operations but I just went for gloo backend afterwards, which doesnāt required PyTorch compiled from source to run.
PyTorch wheel is compiled in docker manylinux:native-manylinux-builder-cpu-main. Because there is no openmpi,ļ¼USE_MPI:BOOL=OFFļ¼. If you encounter this problem, I suggest you raise an issue in the community.
Support USE_MPI:BOOL=ON, so you donāt need to Build PyTorch from source