I am running into a very frustrating issue involving Detectron2 and a multi-gpu setup on docker. It works fine without docker, but in docker I get the follow error after loading the coco data and then calling the trainer. Is there some NCCL setting that I’m not seeing that I have to set?
Traceback (most recent call last):
File "perception/isaac_kitti.py", line 367, in <module>
args=(args,),
File "/home/scenesearch/src/detectron2/detectron2/engine/launch.py", line 59, in launch
daemon=False,
File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/scenesearch/src/detectron2/detectron2/engine/launch.py", line 94, in _distributed_worker
main_func(*args)
File "/home/scenesearch/perception/isaac_kitti.py", line 160, in train
trainer = IsaacKittiTrainer(cfg)
File "/home/scenesearch/src/detectron2/detectron2/engine/defaults.py", line 284, in __init__
data_loader = self.build_train_loader(cfg)
File "/home/scenesearch/src/detectron2/detectron2/engine/defaults.py", line 473, in build_train_loader
return build_detection_train_loader(cfg)
File "/home/scenesearch/src/detectron2/detectron2/config/config.py", line 201, in wrapped
explicit_args = _get_args_from_config(from_config, *args, **kwargs)
File "/home/scenesearch/src/detectron2/detectron2/config/config.py", line 238, in _get_args_from_config
ret = from_config_func(*args, **kwargs)
File "/home/scenesearch/src/detectron2/detectron2/data/build.py", line 327, in _train_loader_from_config
sampler = TrainingSampler(len(dataset))
File "/home/scenesearch/src/detectron2/detectron2/data/samplers/distributed_sampler.py", line 37, in __init__
seed = comm.shared_random_seed()
File "/home/scenesearch/src/detectron2/detectron2/utils/comm.py", line 230, in shared_random_seed
all_ints = all_gather(ints)
File "/home/scenesearch/src/detectron2/detectron2/utils/comm.py", line 154, in all_gather
group = _get_global_gloo_group()
File "/home/scenesearch/src/detectron2/detectron2/utils/comm.py", line 89, in _get_global_gloo_group
return dist.new_group(backend="gloo")
File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2508, in new_group
timeout=timeout)
File "/home/scenesearch/miniconda3/envs/scenesearch/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 592, in _new_process_group_helper
timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1616554800319/work/third_party/gloo/gloo/transport/tcp/device.cc:208] ifa != nullptr. Unable to find interface for: [0.31.32.145]
I’ve tried a lot of configurations for NCCL, with the current version having the following set:
export NCCL_SOCKET_IFNAME=eth0; export NCCL_IB_DISABLE=1; export NCCL_DEBUG=info; export NCCL_P2P_DISABLE=1
Below is the NCCL_DEBUG output, but I don’t see anything that would be suggestive of the actual error. There appears to be only one issue on the Detectron2 github page about this where they say this is a DDP problem, not a detectron concern. I wonder if this is actually a docker issue.
2039953:3182:3182 [0] NCCL INFO Bootstrap : Using [0]eth0:100.104.55.225<0>
2039953:3182:3182 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
2039953:3182:3182 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
2039953:3182:3182 [0] NCCL INFO NET/Socket : Using [0]eth0:100.104.55.225<0>
2039953:3182:3182 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
2039953:3183:3183 [1] NCCL INFO Bootstrap : Using [0]eth0:100.104.55.225<0>
2039953:3183:3183 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
2039953:3183:3183 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
2039953:3183:3183 [1] NCCL INFO NET/Socket : Using [0]eth0:100.104.55.225<0>
2039953:3183:3183 [1] NCCL INFO Using network Socket
2039953:3183:3351 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
2039953:3182:3350 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
2039953:3182:3350 [0] NCCL INFO Channel 00/02 : 0 1
2039953:3182:3350 [0] NCCL INFO Channel 01/02 : 0 1
2039953:3183:3351 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
2039953:3183:3351 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
2039953:3182:3350 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
2039953:3182:3350 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
2039953:3182:3350 [0] NCCL INFO Channel 00 : 0[60] -> 1[70] via direct shared memory
2039953:3183:3351 [1] NCCL INFO Channel 00 : 1[70] -> 0[60] via direct shared memory
2039953:3182:3350 [0] NCCL INFO Channel 01 : 0[60] -> 1[70] via direct shared memory
2039953:3183:3351 [1] NCCL INFO Channel 01 : 1[70] -> 0[60] via direct shared memory
2039953:3182:3350 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
2039953:3182:3350 [0] NCCL INFO comm 0x7f419c002dd0 rank 0 nranks 2 cudaDev 0 busId 60 - Init COMPLETE
2039953:3183:3351 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
2039953:3182:3182 [0] NCCL INFO Launch mode Parallel
2039953:3183:3351 [1] NCCL INFO comm 0x7f818c002dd0 rank 1 nranks 2 cudaDev 1 busId 70 - Init COMPLETE
Here is my Dockerfile:
FROM nvcr.io/nvidia/pytorch:20.11-py3
USER root
RUN useradd -ms /bin/bash scenesearch
RUN apt-get update
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=America/New_York
RUN apt-get install libgl1-mesa-glx -y
RUN apt-get install ffmpeg libsm6 libxext6 -y
RUN apt-get install -y software-properties-common &&\
apt-add-repository universe &&\
apt-get update &&\
apt-get install -y python3-pip
RUN apt-get install -y libpng16-16 libtiff5 libjpeg-turbo8 wget && rm -rf /var/lib/apt/lists/*
WORKDIR /home/scenesearch
COPY . /home/scenesearch
RUN chmod -R 777 ./
USER scenesearch
RUN wget \
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
# && mkdir ./.conda \
&& bash Miniconda3-latest-Linux-x86_64.sh -b \
&& rm -f Miniconda3-latest-Linux-x86_64.sh
ENV PATH="./miniconda3/bin:${PATH}"
ARG PATH="./miniconda3/bin:${PATH}"
RUN conda create -n scenesearch python=3.7.9
SHELL ["conda", "run", "-n", "scenesearch", "/bin/bash", "-c"]
RUN conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
RUN pip install --upgrade pip
RUN pip install nuscenes-devkit
RUN pip install pygame networkx
RUN pip install --no-cache-dir -r requirements.txt
ENV NVIDIA_DRIVER_CAPABILITIES=all