Unable to use MPI rendezvous in Caffe2

gyani91 · August 16, 2018, 2:36pm

I have been working with Caffe2 for 6 weeks now. I am stuck at an issue from past 25 days, I have searched the internet far and wide and have tried several things.

The issue in a single line: Unable to use MPI rendezvous in Caffe2

Environment: Cray XC40/XC50 supercomputer, uses SLURM!

Details:
For reproducibility, I am using a container made using the following the Dockerfile:

FROM nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04
LABEL maintainer="aaronmarkham@fb.com"

# caffe2 install with gpu support


RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    cmake \
    git \
    libgflags-dev \
    libgoogle-glog-dev \
    libgtest-dev \
    libiomp-dev \
    libleveldb-dev \
    liblmdb-dev \
    libopencv-dev \
    libprotobuf-dev \
    libsnappy-dev \
    protobuf-compiler \
    python-dev \
    python-numpy \
    python-pip \
    python-pydot \
    python-setuptools \
    python-scipy \
    wget \
    && rm -rf /var/lib/apt/lists/*

RUN wget -q http://www.mpich.org/static/downloads/3.1.4/mpich-3.1.4.tar.gz \
    && tar xf mpich-3.1.4.tar.gz \
    && cd mpich-3.1.4 \
    && ./configure --disable-fortran --enable-fast=all,O3 --prefix=/usr \
    && make -j$(nproc) \
    && make install \
    && ldconfig \
    && cd .. \
    && rm -rf mpich-3.1.4 \
    && rm mpich-3.1.4.tar.gz

RUN pip install --no-cache-dir --upgrade pip==9.0.3 setuptools wheel
RUN pip install --no-cache-dir \
    flask \
    future \
    graphviz \
    hypothesis \
    jupyter \
    matplotlib \
    numpy \
    protobuf \
    pydot \
    python-nvd3 \
    pyyaml \
    requests \
    scikit-image \
    scipy \
    setuptools \
    six \
    tornado

########## INSTALLATION STEPS ###################
RUN git clone --branch master --recursive https://github.com/pytorch/pytorch.git
RUN cd pytorch && mkdir build && cd build \
    && cmake .. \
    -DCUDA_ARCH_NAME=Manual \
    -DCUDA_ARCH_BIN="35 52 60 61" \
    -DCUDA_ARCH_PTX="61" \
    -DUSE_NNPACK=OFF \
    -DUSE_ROCKSDB=OFF \
    && make -j"$(nproc)" install \
    && ldconfig \
    && make clean \
    && cd .. \
    && rm -rf build

ENV PYTHONPATH /usr/local

The command:

srun -N 4 -n 4 -C gpu \
shifter run --mpi load/library/caffe2_container_diff \
python resnet50_trainer.py \
--train_data=$SCRATCH/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_train \
--test_data=$SCRATCH/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_val \
--db_type=lmdb \
--num_shards=4 \
--num_gpu=1 \
--num_labels=2 \
--batch_size=2 \
--epoch_size=150 \
--num_epochs=2 \
--distributed_transport ibverbs \
--distributed_interface mlx5_0

The output/error:

srun: job 9059937 queued and waiting for resources
srun: job 9059937 has been allocated resources
E0816 14:14:20.081552  7042 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.081637  7042 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.081642  7042 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083420  6442 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083504  6442 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083509  6442 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
E0816 14:14:20.087043  5987 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.087126  5987 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.087131  5987 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
E0816 14:14:20.102372 11086 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.102452 11086 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.102457 11086 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
INFO:data_parallel_model:Creating barrier net
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
PC: @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
PC: @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
PC: @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
*** SIGSEGV (@0x8) received by PID 5987 (TID 0x2aaaaaae5480) from PID 8; stack trace: ***
    @     0x2aaaaace4390 (unknown)
    @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
*** SIGSEGV (@0x8) received by PID 7042 (TID 0x2aaaaaae5480) from PID 8; stack trace: ***
    @     0x2aaaaace4390 (unknown)
    @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
*** SIGSEGV (@0x8) received by PID 6442 (TID 0x2aaaaaae5480) from PID 8; stack trace: ***
    @     0x2aaaaace4390 (unknown)
    @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
    @     0x2aaab0af78d3 std::_Function_handler<>::_M_invoke()
PC: @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
    @     0x2aaab0af78d3 std::_Function_handler<>::_M_invoke()
    @     0x2aaab09e8094 caffe2::InferBlobShapesAndTypes()
    @     0x2aaab09e9659 caffe2::InferBlobShapesAndTypesFromMap()
    @     0x2aaab0af78d3 std::_Function_handler<>::_M_invoke()
    @     0x2aaab09e8094 caffe2::InferBlobShapesAndTypes()
    @     0x2aaab032588e _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKSt6vectorINS_5bytesESaIS7_EESt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IlSaIlEESt4lessISI_ESaISt4pairIKSI_SK_EEEE36_S7_JSB_SR_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES19_
    @     0x2aaab09e9659 caffe2::InferBlobShapesAndTypesFromMap()
    @     0x2aaab035273e pybind11::cpp_function::dispatcher()
    @           0x4bc3fa PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @     0x2aaab032588e _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKSt6vectorINS_5bytesESaIS7_EESt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IlSaIlEESt4lessISI_ESaISt4pairIKSI_SK_EEEE36_S7_JSB_SR_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES19_
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @     0x2aaab09e8094 caffe2::InferBlobShapesAndTypes()
    @           0x4eb30f (unknown)
    @           0x4e5422 PyRun_FileExFlags
    @           0x4e3cd6 PyRun_SimpleFileExFlags
    @           0x493ae2 Py_Main
    @     0x2aaaaaf10830 __libc_start_main
    @           0x4933e9 _start
    @     0x2aaab09e9659 caffe2::InferBlobShapesAndTypesFromMap()
*** SIGSEGV (@0x8) received by PID 11086 (TID 0x2aaaaaae5480) from PID 8; stack trace: ***
    @     0x2aaab035273e pybind11::cpp_function::dispatcher()
    @                0x0 (unknown)
    @           0x4bc3fa PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @     0x2aaab032588e _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKSt6vectorINS_5bytesESaIS7_EESt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IlSaIlEESt4lessISI_ESaISt4pairIKSI_SK_EEEE36_S7_JSB_SR_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES19_
    @     0x2aaaaace4390 (unknown)
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4eb30f (unknown)
    @           0x4e5422 PyRun_FileExFlags
    @           0x4e3cd6 PyRun_SimpleFileExFlags
    @           0x493ae2 Py_Main
    @     0x2aaaaaf10830 __libc_start_main
    @           0x4933e9 _start
    @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
    @                0x0 (unknown)
    @     0x2aaab035273e pybind11::cpp_function::dispatcher()
    @           0x4bc3fa PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4eb30f (unknown)
    @           0x4e5422 PyRun_FileExFlags
    @           0x4e3cd6 PyRun_SimpleFileExFlags
    @           0x493ae2 Py_Main
    @     0x2aaaaaf10830 __libc_start_main
    @           0x4933e9 _start
    @                0x0 (unknown)
    @     0x2aaab0af78d3 std::_Function_handler<>::_M_invoke()
    @     0x2aaab09e8094 caffe2::InferBlobShapesAndTypes()
    @     0x2aaab09e9659 caffe2::InferBlobShapesAndTypesFromMap()
    @     0x2aaab032588e _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKSt6vectorINS_5bytesESaIS7_EESt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IlSaIlEESt4lessISI_ESaISt4pairIKSI_SK_EEEE36_S7_JSB_SR_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES19_
    @     0x2aaab035273e pybind11::cpp_function::dispatcher()
    @           0x4bc3fa PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4eb30f (unknown)
    @           0x4e5422 PyRun_FileExFlags
    @           0x4e3cd6 PyRun_SimpleFileExFlags
    @           0x493ae2 Py_Main
    @     0x2aaaaaf10830 __libc_start_main
    @           0x4933e9 _start
    @                0x0 (unknown)
srun: error: nid06499: task 2: Segmentation fault
srun: Terminating job step 9059937.0
srun: error: nid06497: task 0: Segmentation fault
srun: error: nid06498: task 1: Segmentation fault
srun: error: nid06500: task 3: Segmentation fault

I understand that this information may not be sufficient for helping me out. Hence, I request you to ask me to perform whatever steps that are required to get more information about the situation.

I am grateful for your help.