# of available cuda devices less than total # of GPUs?

I have entered a very strange situation, I was training a model on a 8xA100-40GB SXM node. Then the training process started hanging for unknown reasons. I killed the training process & decided to restart it.

However, my number of available GPUs has gone down from 8 to 5. Specifically, I ran the following script:

import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
from subprocess import call
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())

print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())

and I got the following output:

__Python VERSION: 3.8.10 (default, Jun 22 2022, 20:18:18) 
[GCC 9.4.0]
__pyTorch VERSION: 1.11.0
__CUDA VERSION
__CUDNN VERSION: 8303
__Number CUDA Devices: 5
__Devices
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, NVIDIA A100-SXM4-40GB, 510.47.03, 40960 MiB, 2 MiB, 40351 MiB
1, NVIDIA A100-SXM4-40GB, 510.47.03, 40960 MiB, 0 MiB, 40354 MiB
2, NVIDIA A100-SXM4-40GB, 510.47.03, 40960 MiB, 2 MiB, 40351 MiB
3, NVIDIA A100-SXM4-40GB, 510.47.03, 40960 MiB, 2 MiB, 40351 MiB
4, NVIDIA A100-SXM4-40GB, 510.47.03, 40960 MiB, 0 MiB, 40354 MiB
5, NVIDIA A100-SXM4-40GB, 510.47.03, 40960 MiB, 2 MiB, 40351 MiB
6, NVIDIA A100-SXM4-40GB, 510.47.03, 40960 MiB, 0 MiB, 40354 MiB
7, NVIDIA A100-SXM4-40GB, 510.47.03, 40960 MiB, 2 MiB, 40351 MiB
Active CUDA Device: GPU 0
Available devices  5
Current cuda device  0

I have no CUDA_VISIBLE_DEVICES (or whatever the variable is called) set. Also, I ran htop and there are no Python processes running. And nvidia-smi is showing 0% utilization, so I don’t think the issue is some other process is using the GPU.

Does anyone know what might be going on?

In the past this variable was still set without the user being aware of it, so you could try to reset it to an empty string and retry the code. Also, try to run any CUDA application for multi-GPUs (e.g. nccl-tests) to check if all devices can be used or not.

@ptrblck nccl-tests fails to run, it complains about an invalid device ordinal. I guess the issue is not Pytorch.

Thanks for checking it as it’s a god indication of a PyTorch-unrelated issue.
Was this setup working before? If so, do you see any Xid errors in dmesg -T?

@ptrblck Thanks for the suggestion!

I’ve run into this error on 2 different machines. The first time, my training process randomly stalled, and when I shut it down (stopped the Docker container), 5/8 GPUs were usable. The second time, I shut down my training process before it stalled and now 7/8 GPUs are usable.

dmesg -T | grep Xid

returns:

[Fri Sep 23 06:18:09 2022] NVRM: Xid (PCI:0000:0d:00): 119, pid=87763, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:09 2022] NVRM: Xid (PCI:0000:0d:00): 119, pid=87763, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:09 2022] NVRM: Xid (PCI:0000:0d:00): 119, pid=87763, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:09 2022] NVRM: Xid (PCI:0000:0d:00): 119, pid=87763, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:09 2022] NVRM: Xid (PCI:0000:0d:00): 119, pid=87763, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:09 2022] NVRM: Xid (PCI:0000:0d:00): 119, pid=87763, Timeout waiting for RPC from GSP! Expected function 76.
[Fri Sep 23 06:18:17 2022] NVRM: Xid (PCI:0000:07:00): 119, pid=87502, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:17 2022] NVRM: Xid (PCI:0000:07:00): 119, pid=87502, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:17 2022] NVRM: Xid (PCI:0000:07:00): 119, pid=87502, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:17 2022] NVRM: Xid (PCI:0000:07:00): 119, pid=87502, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:17 2022] NVRM: Xid (PCI:0000:07:00): 119, pid=87502, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:17 2022] NVRM: Xid (PCI:0000:07:00): 119, pid=87502, Timeout waiting for RPC from GSP! Expected function 76.
[Fri Sep 23 06:18:22 2022] NVRM: Xid (PCI:0000:06:00): 119, pid=87501, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:22 2022] NVRM: Xid (PCI:0000:06:00): 119, pid=87501, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:22 2022] NVRM: Xid (PCI:0000:06:00): 119, pid=87501, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:22 2022] NVRM: Xid (PCI:0000:06:00): 119, pid=87501, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:22 2022] NVRM: Xid (PCI:0000:06:00): 119, pid=87501, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:22 2022] NVRM: Xid (PCI:0000:06:00): 119, pid=87501, Timeout waiting for RPC from GSP! Expected function 76.
[Fri Sep 23 06:18:31 2022] NVRM: Xid (PCI:0000:08:00): 119, pid=87503, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:31 2022] NVRM: Xid (PCI:0000:08:00): 119, pid=87503, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:31 2022] NVRM: Xid (PCI:0000:08:00): 119, pid=87503, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:31 2022] NVRM: Xid (PCI:0000:08:00): 119, pid=87503, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:31 2022] NVRM: Xid (PCI:0000:08:00): 119, pid=87503, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:31 2022] NVRM: Xid (PCI:0000:08:00): 119, pid=87503, Timeout waiting for RPC from GSP! Expected function 76.
[Fri Sep 23 06:18:36 2022] NVRM: Xid (PCI:0000:09:00): 119, pid=87543, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:36 2022] NVRM: Xid (PCI:0000:09:00): 119, pid=87543, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:36 2022] NVRM: Xid (PCI:0000:09:00): 119, pid=87543, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:36 2022] NVRM: Xid (PCI:0000:09:00): 119, pid=87543, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:36 2022] NVRM: Xid (PCI:0000:09:00): 119, pid=87543, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:36 2022] NVRM: Xid (PCI:0000:09:00): 119, pid=87543, Timeout waiting for RPC from GSP! Expected function 76.
[Fri Sep 23 06:18:42 2022] NVRM: Xid (PCI:0000:0c:00): 119, pid=87725, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:42 2022] NVRM: Xid (PCI:0000:0c:00): 119, pid=87725, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:42 2022] NVRM: Xid (PCI:0000:0c:00): 119, pid=87725, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:42 2022] NVRM: Xid (PCI:0000:0c:00): 119, pid=87725, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:42 2022] NVRM: Xid (PCI:0000:0c:00): 119, pid=87725, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:42 2022] NVRM: Xid (PCI:0000:0c:00): 119, pid=87725, Timeout waiting for RPC from GSP! Expected function 76.
[Fri Sep 23 06:18:48 2022] NVRM: Xid (PCI:0000:0a:00): 119, pid=87575, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:48 2022] NVRM: Xid (PCI:0000:0a:00): 119, pid=87575, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:48 2022] NVRM: Xid (PCI:0000:0a:00): 119, pid=87575, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:48 2022] NVRM: Xid (PCI:0000:0a:00): 119, pid=87575, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:48 2022] NVRM: Xid (PCI:0000:0a:00): 119, pid=87575, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:48 2022] NVRM: Xid (PCI:0000:0a:00): 119, pid=87575, Timeout waiting for RPC from GSP! Expected function 76.
[Fri Sep 23 06:18:54 2022] NVRM: Xid (PCI:0000:0b:00): 119, pid=87587, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:54 2022] NVRM: Xid (PCI:0000:0b:00): 119, pid=87587, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:54 2022] NVRM: Xid (PCI:0000:0b:00): 119, pid=87587, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:54 2022] NVRM: Xid (PCI:0000:0b:00): 119, pid=87587, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:54 2022] NVRM: Xid (PCI:0000:0b:00): 119, pid=87587, Timeout waiting for RPC from GSP! Expected function 10.
[Fri Sep 23 06:18:54 2022] NVRM: Xid (PCI:0000:0b:00): 119, pid=87587, Timeout waiting for RPC from GSP! Expected function 76.

Thanks for the update and the output. Are you seeing Xid 119 on both nodes?
Could you post more setup information, i.e. the output of python -m torch.utils.collect_env and the output of nvidia-smi? Also, which docker container were you using?
Would it also be possible to get an nvidia-bug-report? You should be able to execute sudo nvidia-bug-report.sh on your node and could upload the data somewhere or send it directly to me (send me a private message in case you need my email or so).

Alas, I just shut down the machine because it costs money to run. But, on my previous machine (the one with 5/8 GPUs usable), I did run sudo nvidia-bug-report.sh, so here’s the output of that: https://drive.google.com/file/d/1KEm6ud3h5tNGMQ8i5y5A3myO9iN6Usxo/view?usp=sharing.

Here’s the Dockerfile:

FROM nvidia/cuda:11.6.1-devel-ubuntu20.04

WORKDIR /app

# Prevents apt from giving prompts
# Set as ARG so it does not persist after build
# https://serverfault.com/questions/618994/when-building-from-dockerfile-debian-ubuntu-package-install-debconf-noninteract
ARG DEBIAN_FRONTEND=noninteractive


ARG ENV=my-env
# or conda run --no-capture-output -n ${ENV}
# You need to escape spaces
ARG RUN=micromamba\ run\ -n\ ${ENV}
ARG PILLOW_PSEUDOVERSION=7.0.0
ARG PILLOW_SIMD_VERSION=7.0.0.post3

# Docker docs: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
# In addition, when you clean up the apt cache by removing /var/lib/apt/lists it reduces the image size, since the apt cache is not stored in a layer.
# Since the RUN statement starts with apt-get update, the package cache is always refreshed prior to apt-get install.
RUN apt update && apt install \
    # for installing miniconda
    curl \
    -y && rm -rf /var/lib/apt/lists/*


# build-essential yasm cmake libtool libc6 libc6-dev unzip wget libnuma1 libnuma-dev pkg-config \

# Install miniconda.sh
# ENV PATH="/root/miniconda3/bin:${PATH}"
# COPY ./environment.yml ./environment.yml
# RUN curl \
# 	https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-Linux-x86_64.sh -o miniconda.sh \
#         && mkdir /root/.conda \
#         && bash miniconda.sh -b \
#         && rm miniconda.sh

# Install micromamba.sh
ENV PATH="/root/bin/:${PATH}"
RUN curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba \
	&& mkdir /root/bin \
	&& mv bin/micromamba /root/bin \
	&& rmdir bin

RUN apt update && apt install \
    # for building torchvision
    git ninja-build \
    -y && rm -rf /var/lib/apt/lists/*

COPY ./environment.yml ./environment.yml
RUN micromamba create -f environment.yml
COPY ./requirements_torchvision.txt ./requirements_torchvision.txt
# pip throws a warning "don't run pip as root", but it's running inside the venv -- so ignore it
# (you can check by attaching to the container & running `pip list` on the base env)
RUN --mount=type=cache,target=/root/.cache ${RUN} pip install -r requirements_torchvision.txt --extra-index-url https://download.pytorch.org/whl/cu116

# Install pillow-simd w/ libjpeg-turbo
COPY ./docker-utils/pillow_stub /tmp/pillow_stub
RUN --mount=type=cache,target=/root/.cache \
    # Uninstall existing pillow / libjpeg
    # (nothing to uninstall in conda env)
    # micromamba remove -n ${ENV} --force pillow pil jpeg libtiff libjpeg-turbo \
    ${RUN} pip uninstall -y pillow pil jpeg libtiff libjpeg-turbo \
    && micromamba install -n ${ENV} -y -c conda-forge libjpeg-turbo \
    # trick Python into thinking pillow is already installed, this will prevent future packages from actually installing pillow
    && ${RUN} pip install --no-cache-dir --upgrade /tmp/pillow_stub \
    && env CFLAGS="-mavx2" ${RUN} pip install --upgrade --no-cache-dir --force-reinstall --no-binary :all: --compile pillow-simd==${PILLOW_SIMD_VERSION}

# torchvision (FFMPEG dependency is installed through conda)
RUN git clone --depth 1 --branch v0.13.1 https://github.com/pytorch/vision.git vision \
    && cd vision \
    # remove the existing torchvision
    # && conda run --no-capture-output -n video-rec pip uninstall --yes torchvision \
    && ${RUN} python3 setup.py install

# We have CUDA 11, so this will also compile for A100s
RUN git clone \
   https://github.com/HazyResearch/flash-attention.git \
   && cd flash-attention \
   && git checkout 8166063a556e17e03e4a0697ba604def1eeb6a99 \
   && ${RUN} python setup.py install

# install the rest of the requirements
COPY ./requirements.txt ./requirements.txt
RUN --mount=type=cache,target=/root/.cache ${RUN} pip install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache ${RUN} pip install -U scalene==1.5.11

# TODO: FFMPEG for GPU video-decoding
# TODO: Pillow SIMD for fast image augmentations

# Uncomment + remove FFMPEG from environment.yml if using GPU decoding
# Compile nv-codec
# RUN git clone --depth 1 --branch n11.1.5.1 https://git.videolan.org/git/ffmpeg/nv-codec-headers.git && \
#     cd nv-codec-headers && \
#     make install -j 100

# Build FFMPEG with Nvidia, torch requires FFMPEG 4.2 (I think)
# RUN git clone --depth 1 --branch n4.2.7 https://git.ffmpeg.org/ffmpeg.git ffmpeg/ \
#      && cd ffmpeg && \
#     ./configure --enable-nonfree --enable-shared --enable-cuda-nvcc --enable-libnpp --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64 \
#     # I need this, I believe this generates code that works for A100s
#     # See: https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
#     # SM80 or SM_80, compute_80  => NVIDIA A100
#     # SM76 => Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, A5000, A6000, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU
#     --nvccflags="-gencode arch=compute_80,code=sm_80 -O2" \
#     && make -j 100  \
#     && make install

@ptrblck Is there a chance this is an issue in my code, or do you think this is most likely an issue in the NVIDIA drivers?

Too early to tell what the root cause is, as I need to take a proper look at your Dockerfile, the bug report, and potentially the code you are running, but in any case your GPUs shouldn’t just drop even if your code is completely broken.

To clarify, I am running all my code inside of a Docker container, and this GPU issue is happening even once my Docker container is spun down. (I.e, I am running nccl-tests, etc. directly on the host machine).

Also, this error is happening on a Lambda Labs 8xA100-40GB SXM machine. Not sure how helpful that is, but figured I’d throw it in there.

Yes, that’s helpful. Would it be possible to get a code snippet which is reproducing the issue and the expected failure rate (i.e. how long it would take to run the code before you were seeing issues)?

Also, the output of nvidia-smi would still be helpful to see as I would need to see the installed driver version.