Bizarre docker inference run error, requesting feedback

Traceback and dockerfile is below. I cannot figure out what possibly could be causing this, and I’m not sure whether to report issue to moby(docker) or pytorch:

  • previous docker build ran just fine
  • 3 models are used in the pipeline (human pose estimation)
  • I made minor trivial edits to some code unrelated to the models themselves
  • rebuilt, then suddenly the last model (D3DP) fails to run with error below
  • happens also after reboot
  • the other models get on the gpu and run just fine
  • D3DP model was wrapped in nn.DataParallel, but error occurs even when I remove that.

I cannot figure this out, nothing has changed, I have no idea why the last model, D3DP, suddenly doesn’t have the ‘device’ attribute. Unfortunately I didn’t have the presence of mind to save the docker build output, but most of the docker image cache’d layers were reused, so I cannot see any dependencies having changed.

Traceback (most recent call last):
  File "pose_inference_3D.py", line 302, in <module>
    pose_inference_3D_main(args)
  File "pose_inference_3D.py", line 164, in pose_inference_3D_main
    bs=args.batch_size) # b, t, h, 243, j, c
  File "/video-to-pose3D/common/utils.py", line 315, in evaluate_diffusion
    input_2d_flip=inputs_2d_flip_single)  # b, t, h, f, j, c
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/video-to-pose3D/common/diffusionpose.py", line 276, in forward
    results = self.ddim_sample(input_2d, input_3d)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/video-to-pose3D/common/diffusionpose.py", line 182, in ddim_sample
    img = torch.randn(shape, device=self.device)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1131, in __getattr__
    type(self).__name__, name))
AttributeError: 'D3DP' object has no attribute 'device'

dockerfile:

# Version 1 of the Dockerfile for 
ARG PYTORCH="1.9.0"
ARG CUDA="11.1"
ARG CUDNN="8"
# try to use the runtime image instead, half size
# FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-runtime
ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"


RUN apt-get update && apt-get install -y \
    gnupg2 \
    gcc \
    git \
    ninja-build \
    libglib2.0-0 \
    libsm6 \
    libxrender-dev \
    libxext6 \
    libgl1-mesa-glx \
    vim \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# To fix GPG key error when running apt-get update
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub

# Install xtcocotools
RUN pip install cython
RUN pip install xtcocotools

# Install MMCV and MMDET
RUN pip install mmcv-full==1.3.18 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
RUN pip install mmdet==2.28.2
run conda clean --all

# Install ViTPose
WORKDIR /video-to-pose3D/joints_detectors/ViTPose
# RUN git clone https://github.com/ViTAE-Transformer/ViTPose.git /video-to-pose3D/joints_detectors/ViTPose
COPY ./ViTPose /video-to-pose3D/joints_detectors/ViTPose/
RUN pip install --no-cache-dir -e .
RUN pip install timm==0.4.9 einops

# Install video-to-pose3D with D3DP
WORKDIR /video-to-pose3D
COPY ./video-to-pose3D /video-to-pose3D/
RUN pip install -r requirements.txt

ENTRYPOINT [ "python3" ]
CMD [ "pose_inference_3D.py" ]

Sincerely, it looks like the src repo you where using has changed. Cos the error is very clear, D3DP is not a nn.Module but a object. Maybe they abstracted something and reformatted the code. Or at least go where the clase is declared and check that.

I didn’t pull anything from any of the repos though, so no other src code has changed. Thats what’s getting me.

I’ll try running a repl in the container and see what I can find

Check if you are overriding some module attributes or methods causing these issues by reverting these minor and trivial edits.
If you get stuck, please post a minimal and executable code snippet reproducing the issue, so that we could help debugging it.