Maybe a BUG: `torch.cuda.is_available()` returns `False` in certain torch, CUDA and driver version

Hi, I’m trying to create a Docker container with the following CUDA 12.4.1 Dockerfile (host info: Driver Version: 550.107.02 CUDA Version: 12.4, detailed versions can be found in: BUG: `torch.cuda.is_available()` returns `False` in certain torch, CUDA and driver version · Issue #135508 · pytorch/pytorch · GitHub):

FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

ARG DEBIAN_FRONTEND=noninteractive

# Install common tool & conda
RUN apt-get update && apt-get install -y \
    software-properties-common \
    && add-apt-repository ppa:deadsnakes/ppa \
    && apt install -y python3.10 \
    && rm -rf /var/lib/apt/lists/*

RUN apt update && \
    apt install wget -y && \
    apt install git -y && \
    apt install curl -y && \
    apt install vim -y && \
    apt install bc && \
    apt-get install net-tools -y && \
    apt install ssh -y && \
    wget --quiet https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh -O ~/anaconda.sh && \
    /bin/bash ~/anaconda.sh -b -p /opt/conda && \
    rm ~/anaconda.sh && \
    mkdir -p /opt/conda/envs/finetune && \
    ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
    echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate base" >> ~/.bashrc


# Workspace
WORKDIR /app

# Install conda finetune env
# COPY requirements.txt requirements.txt
RUN . /opt/conda/etc/profile.d/conda.sh && \
    conda create --name finetune python=3.10 -y && \
    conda activate finetune && \
    curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10

# Cuda path
ENV CUDA_PATH=/usr/local/cuda
ENV LD_LIBRARY_PATH=$CUDA_PATH/lib64:$CUDA_PATH/compat:/usr/lib/x86_64-linux-gnu:$CUDA_PATH/targets/x86_64-linux/lib/stubs/:$LD_LIBRARY_PATH
ENV CUDNN_PATH=/usr/include
# Transformer engine path
ENV NVTE_FRAMEWORK=pytorch

# Copy workspace
COPY . .

# Enterpoint for bash shell
ENTRYPOINT ["/bin/bash"]

This just create a basic nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 image and install conda and pip. Then, I run the container with the following command:

docker run --runtime=nvidia -it --rm --gpus all --shm-size 64g --network=host --privileged --volume [USER_PATH]/.cache:/root/.cache --env NVIDIA_DISABLE_REQUIRE=1 username/imagename:tag

Then, inside the container, I install the latest stable torch (2.4.1) by:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

After that, I run the simplest torch cuda test by:

python -c "import torch; print(torch.cuda.is_available())"

What I got is:

/opt/conda/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

This is quite strange, since if I simlply turn to use the base image nvidia/cuda:12.5.1-cudnn-devel-ubuntu22.04, I can got the correct result that torch.cuda.is_available() returns True.

Any advice will be sincerely appreciated, thx!

The corresponding issue on GitHub has a valid suggestion of compiling a small standalone CUDA example to verify the container itself is working in your environment.