Hi, I’m trying to create a Docker container with the following CUDA 12.4.1
Dockerfile (host info: Driver Version: 550.107.02
CUDA Version: 12.4
, detailed versions can be found in: BUG: `torch.cuda.is_available()` returns `False` in certain torch, CUDA and driver version · Issue #135508 · pytorch/pytorch · GitHub):
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
ARG DEBIAN_FRONTEND=noninteractive
# Install common tool & conda
RUN apt-get update && apt-get install -y \
software-properties-common \
&& add-apt-repository ppa:deadsnakes/ppa \
&& apt install -y python3.10 \
&& rm -rf /var/lib/apt/lists/*
RUN apt update && \
apt install wget -y && \
apt install git -y && \
apt install curl -y && \
apt install vim -y && \
apt install bc && \
apt-get install net-tools -y && \
apt install ssh -y && \
wget --quiet https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh -O ~/anaconda.sh && \
/bin/bash ~/anaconda.sh -b -p /opt/conda && \
rm ~/anaconda.sh && \
mkdir -p /opt/conda/envs/finetune && \
ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
echo "conda activate base" >> ~/.bashrc
# Workspace
WORKDIR /app
# Install conda finetune env
# COPY requirements.txt requirements.txt
RUN . /opt/conda/etc/profile.d/conda.sh && \
conda create --name finetune python=3.10 -y && \
conda activate finetune && \
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
# Cuda path
ENV CUDA_PATH=/usr/local/cuda
ENV LD_LIBRARY_PATH=$CUDA_PATH/lib64:$CUDA_PATH/compat:/usr/lib/x86_64-linux-gnu:$CUDA_PATH/targets/x86_64-linux/lib/stubs/:$LD_LIBRARY_PATH
ENV CUDNN_PATH=/usr/include
# Transformer engine path
ENV NVTE_FRAMEWORK=pytorch
# Copy workspace
COPY . .
# Enterpoint for bash shell
ENTRYPOINT ["/bin/bash"]
This just create a basic nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
image and install conda
and pip
. Then, I run the container with the following command:
docker run --runtime=nvidia -it --rm --gpus all --shm-size 64g --network=host --privileged --volume [USER_PATH]/.cache:/root/.cache --env NVIDIA_DISABLE_REQUIRE=1 username/imagename:tag
Then, inside the container, I install the latest stable torch (2.4.1
) by:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
After that, I run the simplest torch cuda test by:
python -c "import torch; print(torch.cuda.is_available())"
What I got is:
/opt/conda/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
This is quite strange, since if I simlply turn to use the base image nvidia/cuda:12.5.1-cudnn-devel-ubuntu22.04
, I can got the correct result that torch.cuda.is_available()
returns True
.
Any advice will be sincerely appreciated, thx!