Hello,
I am trying to run a 3D Unet from MinkowskiEngine using DistributedDataParallel, running from a singularity container with pytorch 1.6 installed via conda. The cluster I am using has P100, V100, and K80 GPUs.
- 1 GPU, no parallelization: Model works on all GPU types
- DataParallel with 1 node, 4 gpu: Model works on all GPU types
- DistributedDataParallel with 1 node, 4 GPU: Model only works on P100 and V100, but fails on K80 with the following error:
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629427478/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f7df16d977d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7f7df1929d9d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7df16c5b1d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x5c (0x7f7dfc36669c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
...
I tried again using horovod (with nccl, infiniband libraries, etc inside the singularity container):
- DistributedDataParallel with 1 node, 4 GPU: Model works on all GPU types
- DistributedDataParallel with 2 nodes, 4 K80 GPUs each (cluster only supports multiple nodes with K80s): Fails with the same error as above.
According to https://github.com/pytorch/pytorch/issues/13541, it has to do with GLIBCXX_USE_CXX11_ABI set to 1. How can I change it to 0?
I am now building pytorch from source, but can’t seem to change it 0. Here is what I have tried:
git clone https://github.com/pytorch/pytorch.git
cd pytorch
git checkout tags/v1.6.0
git submodule update --init --recursive
- Set TORCH_CXX_FLAGS as environmental var:
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" TORCH_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" \
MAX_JOBS=16 python setup.py install
- Modify TorchConfig.cmake.in, comment out ABI line (Issue13541 #issuecomment-512756771)
sed -i 's/set(TORCH_CXX_FLAGS/#set(TORCH_CXX_FLAGS/g' cmake/TorchConfig.cmake.in
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
MAX_JOBS=16 python setup.py install
- Modify TorchConfig.cmake.in, set to 0:
sed -i 's/@GLIBCXX_USE_CXX11_ABI@/0/g' cmake/TorchConfig.cmake.in
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
MAX_JOBS=16 python setup.py install
- Add line to CMakeLists.txt
echo 'set(TORCH_CXX_FLAGS "-D_GLIBCXX_USE_CXX11_ABI=0")' > no_abi.txt
cat no_abi.txt torch/CMakeLists.txt > torch/CMakeLists.txt.1
rm torch/CMakeLists.txt
mv torch/CMakeLists.txt.1 torch/CMakeLists.txt
rm no_abi.txt
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
MAX_JOBS=16 python setup.py install
- Combinations of 1, 3, and 4.
Finally, I check if it is correct by running
python -c "import torch; print('PYTORCH USING CXX11_ABI =', torch._C._GLIBCXX_USE_CXX11_ABI )"
,
but always outputs:
PYTORCH USING CXX11_ABI = True
I am also now getting an error when importing MinkowskiEngine:
ImportError: /opt/conda/lib/python3.7/site-packages/MinkowskiEngine-0.4.3-py3.7-linux-x86_64.egg/MinkowskiEngineBackend.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
which also has to do with GLIBCXX_USE_CXX11_ABI (Cannot build pybind11/libtorch code with cmake)
If you have any suggestion about rebuilding pytorch with -D_GLIBCXX_USE_CXX11_ABI=0, I would really appreciate it.
Here is my Singulairty definition file in case it is helpful: github.com/edraizen/SingularityTorch/blob/rivana_pytorch/Singularity
Thanks for your help!