Setting GLIBCXX_USE_CXX11_ABI to 0

Eli_Draizen · October 2, 2020, 6:00pm

Hello,

I am trying to run a 3D Unet from MinkowskiEngine using DistributedDataParallel, running from a singularity container with pytorch 1.6 installed via conda. The cluster I am using has P100, V100, and K80 GPUs.

1 GPU, no parallelization: Model works on all GPU types
DataParallel with 1 node, 4 gpu: Model works on all GPU types
DistributedDataParallel with 1 node, 4 GPU: Model only works on P100 and V100, but fails on K80 with the following error:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629427478/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f7df16d977d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7f7df1929d9d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7df16c5b1d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x5c (0x7f7dfc36669c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
...

I tried again using horovod (with nccl, infiniband libraries, etc inside the singularity container):

DistributedDataParallel with 1 node, 4 GPU: Model works on all GPU types
DistributedDataParallel with 2 nodes, 4 K80 GPUs each (cluster only supports multiple nodes with K80s): Fails with the same error as above.

According to https://github.com/pytorch/pytorch/issues/13541, it has to do with GLIBCXX_USE_CXX11_ABI set to 1. How can I change it to 0?

I am now building pytorch from source, but can’t seem to change it 0. Here is what I have tried:

 git clone https://github.com/pytorch/pytorch.git
 cd pytorch
 git checkout tags/v1.6.0
 git submodule update --init --recursive

Set TORCH_CXX_FLAGS as environmental var:

TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" TORCH_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" \
    MAX_JOBS=16 python setup.py install

Modify TorchConfig.cmake.in, comment out ABI line (Issue13541 #issuecomment-512756771)

sed -i 's/set(TORCH_CXX_FLAGS/#set(TORCH_CXX_FLAGS/g' cmake/TorchConfig.cmake.in
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    MAX_JOBS=16 python setup.py install

Modify TorchConfig.cmake.in, set to 0:

sed -i 's/@GLIBCXX_USE_CXX11_ABI@/0/g' cmake/TorchConfig.cmake.in
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    MAX_JOBS=16 python setup.py install

Add line to CMakeLists.txt

echo 'set(TORCH_CXX_FLAGS "-D_GLIBCXX_USE_CXX11_ABI=0")' > no_abi.txt
cat no_abi.txt torch/CMakeLists.txt > torch/CMakeLists.txt.1
rm torch/CMakeLists.txt
mv torch/CMakeLists.txt.1 torch/CMakeLists.txt
rm no_abi.txt
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    MAX_JOBS=16 python setup.py install

Combinations of 1, 3, and 4.

Finally, I check if it is correct by running
python -c "import torch; print('PYTORCH USING CXX11_ABI =', torch._C._GLIBCXX_USE_CXX11_ABI )",
but always outputs:
PYTORCH USING CXX11_ABI = True

I am also now getting an error when importing MinkowskiEngine:

ImportError: /opt/conda/lib/python3.7/site-packages/MinkowskiEngine-0.4.3-py3.7-linux-x86_64.egg/MinkowskiEngineBackend.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

which also has to do with GLIBCXX_USE_CXX11_ABI (Cannot build pybind11/libtorch code with cmake)

If you have any suggestion about rebuilding pytorch with -D_GLIBCXX_USE_CXX11_ABI=0, I would really appreciate it.

Here is my Singulairty definition file in case it is helpful: github.com/edraizen/SingularityTorch/blob/rivana_pytorch/Singularity

Thanks for your help!

yselivonchyk · February 18, 2021, 7:01pm

I am not sure what, but some of this worked:

export GLIBCXX_USE_CXX11_ABI=0
export CXXFLAGS="-D_GLIBCXX_USE_CXX11_ABI=0"
export TORCH_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0"
#export DEBUG=1
export USE_GLOO=0
export REL_WITH_DEB_INFO=1
export USE_DISTRIBUTED=0

resulting in python -c “import torch; print(‘PYTORCH USING CXX11_ABI =’, torch._C._GLIBCXX_USE_CXX11_ABI )” → false

jie.hang · September 26, 2021, 4:21am

@ yselivonchyk Eugene
I try to compile pytorch1.8.1 from source as same as you, but occurs undefined reference error:

/home/jhang/bi/hj/pytorch_ori_v1.8/build/lib/libtorch_cuda.so: undefined reference to `gloo::getCudaPCIBusID(int)'
/home/jhang/bi/hj/pytorch_ori_v1.8/build/lib/libtorch_cuda.so: undefined reference to `gloo::EnforceNotMet::EnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
collect2: error: ld returned 1 exit status

jie.hang · September 26, 2021, 4:23am

@ Eli_Draizen
I try to add flag in CMakeLists.txt, add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0), and compile, torch._C._GLIBCXX_USE_CXX11_ABI=false.