Enforce fail: Unable to find interface

I am running into a very frustrating issue involving Detectron2 and a multi-gpu setup on docker. It works fine without docker, but in docker I get the follow error after loading the coco data and then calling the trainer. Is there some NCCL setting that I’m not seeing that I have to set?

Traceback (most recent call last):
  File "perception/isaac_kitti.py", line 367, in <module>
    args=(args,),
  File "/home/scenesearch/src/detectron2/detectron2/engine/launch.py", line 59, in launch
    daemon=False,
  File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/scenesearch/src/detectron2/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/home/scenesearch/perception/isaac_kitti.py", line 160, in train
    trainer = IsaacKittiTrainer(cfg)
  File "/home/scenesearch/src/detectron2/detectron2/engine/defaults.py", line 284, in __init__
    data_loader = self.build_train_loader(cfg)
  File "/home/scenesearch/src/detectron2/detectron2/engine/defaults.py", line 473, in build_train_loader
    return build_detection_train_loader(cfg)
  File "/home/scenesearch/src/detectron2/detectron2/config/config.py", line 201, in wrapped
    explicit_args = _get_args_from_config(from_config, *args, **kwargs)
  File "/home/scenesearch/src/detectron2/detectron2/config/config.py", line 238, in _get_args_from_config
    ret = from_config_func(*args, **kwargs)
  File "/home/scenesearch/src/detectron2/detectron2/data/build.py", line 327, in _train_loader_from_config
    sampler = TrainingSampler(len(dataset))
  File "/home/scenesearch/src/detectron2/detectron2/data/samplers/distributed_sampler.py", line 37, in __init__
    seed = comm.shared_random_seed()
  File "/home/scenesearch/src/detectron2/detectron2/utils/comm.py", line 230, in shared_random_seed
    all_ints = all_gather(ints)
  File "/home/scenesearch/src/detectron2/detectron2/utils/comm.py", line 154, in all_gather
    group = _get_global_gloo_group()
  File "/home/scenesearch/src/detectron2/detectron2/utils/comm.py", line 89, in _get_global_gloo_group
    return dist.new_group(backend="gloo")
  File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2508, in new_group
    timeout=timeout)
  File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 592, in _new_process_group_helper
    timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1616554800319/work/third_party/gloo/gloo/transport/tcp/device.cc:208] ifa != nullptr. Unable to find interface for: [0.31.32.145]

I’ve tried a lot of configurations for NCCL, with the current version having the following set:

export NCCL_SOCKET_IFNAME=eth0; export NCCL_IB_DISABLE=1; export NCCL_DEBUG=info; export NCCL_P2P_DISABLE=1

Below is the NCCL_DEBUG output, but I don’t see anything that would be suggestive of the actual error. There appears to be only one issue on the Detectron2 github page about this where they say this is a DDP problem, not a detectron concern. I wonder if this is actually a docker issue.

2039953:3182:3182 [0] NCCL INFO Bootstrap : Using [0]eth0:100.104.55.225<0>

2039953:3182:3182 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

2039953:3182:3182 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.

2039953:3182:3182 [0] NCCL INFO NET/Socket : Using [0]eth0:100.104.55.225<0>

2039953:3182:3182 [0] NCCL INFO Using network Socket

NCCL version 2.7.8+cuda11.1

2039953:3183:3183 [1] NCCL INFO Bootstrap : Using [0]eth0:100.104.55.225<0>

2039953:3183:3183 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

2039953:3183:3183 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.

2039953:3183:3183 [1] NCCL INFO NET/Socket : Using [0]eth0:100.104.55.225<0>

2039953:3183:3183 [1] NCCL INFO Using network Socket

2039953:3183:3351 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC

2039953:3182:3350 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC

2039953:3182:3350 [0] NCCL INFO Channel 00/02 : 0 1

2039953:3182:3350 [0] NCCL INFO Channel 01/02 : 0 1

2039953:3183:3351 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64

2039953:3183:3351 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1

2039953:3182:3350 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64

2039953:3182:3350 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1

2039953:3182:3350 [0] NCCL INFO Channel 00 : 0[60] -> 1[70] via direct shared memory

2039953:3183:3351 [1] NCCL INFO Channel 00 : 1[70] -> 0[60] via direct shared memory

2039953:3182:3350 [0] NCCL INFO Channel 01 : 0[60] -> 1[70] via direct shared memory

2039953:3183:3351 [1] NCCL INFO Channel 01 : 1[70] -> 0[60] via direct shared memory

2039953:3182:3350 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer

2039953:3182:3350 [0] NCCL INFO comm 0x7f419c002dd0 rank 0 nranks 2 cudaDev 0 busId 60 - Init COMPLETE

2039953:3183:3351 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer

2039953:3182:3182 [0] NCCL INFO Launch mode Parallel

2039953:3183:3351 [1] NCCL INFO comm 0x7f818c002dd0 rank 1 nranks 2 cudaDev 1 busId 70 - Init COMPLETE

Here is my Dockerfile:

FROM nvcr.io/nvidia/pytorch:20.11-py3

USER root
RUN useradd -ms /bin/bash scenesearch

RUN apt-get update
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=America/New_York
RUN apt-get install libgl1-mesa-glx -y
RUN apt-get install ffmpeg libsm6 libxext6  -y
RUN apt-get install -y software-properties-common &&\
        apt-add-repository universe &&\
            apt-get update &&\
            apt-get install -y python3-pip

RUN apt-get install -y libpng16-16 libtiff5 libjpeg-turbo8 wget && rm -rf /var/lib/apt/lists/*

WORKDIR /home/scenesearch
COPY . /home/scenesearch
RUN chmod -R 777 ./

USER scenesearch

RUN wget \
    https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    # && mkdir ./.conda \
    && bash Miniconda3-latest-Linux-x86_64.sh -b \
    && rm -f Miniconda3-latest-Linux-x86_64.sh 

ENV PATH="./miniconda3/bin:${PATH}"
ARG PATH="./miniconda3/bin:${PATH}"

RUN conda create -n scenesearch python=3.7.9

SHELL ["conda", "run", "-n", "scenesearch", "/bin/bash", "-c"]

RUN conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

RUN pip install --upgrade pip
RUN pip install nuscenes-devkit
RUN pip install pygame networkx

RUN pip install --no-cache-dir -r requirements.txt

ENV NVIDIA_DRIVER_CAPABILITIES=all

Hey @crnyu, the above error seems to suggest the code is using gloo instead of NCCL. Have you tried configure GLOO_SOCKET_IFNAME instead?

This was correct. Thanks @mrshenli , much appreciated.