CUDA Error in Docker Container

test_cuda.py:

import torch

def check_cuda():
    print("Is CUDA available in PyTorch:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("Number of CUDA devices:", torch.cuda.device_count())
        for i in range(torch.cuda.device_count()):
            print("CUDA Device #{}: {}".format(i, torch.cuda.get_device_name(i)))

if __name__ == "__main__":
    check_cuda()

Dockerfile:

FROM python:3.8

RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

COPY test_cuda.py /test_cuda.py

CMD ["python3", "/test_cuda.py"]

Machine: AzureStandard NCC24ads A100 v4 (24 vcpus, 220 GiB memory)
OS: Ubuntu 20.04
nvidia-smi:

Mon Jan 22 20:30:30 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.10.12    Driver Version: 470.10.12    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000001:00:00.0 Off |                    0 |
| N/A   28C    P0    42W / 300W |     10MiB / 81251MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If I start a python shell and run torch.cuda.is_available(), I receive True.
If I run “sudo docker run --rm --gpus all ubuntu nvidia-smi”, I get the expected output.

When I try to run a tensorflow image, it detects and uses the GPU.

But when I try to run my test dockerfile with torch, I receive this error:

/usr/local/lib/python3.8/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: 
Unexpected error from cudaGetDeviceCount(). 
Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? 
Error 801: operation not supported (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0

I have tried to change driver versions, install fabric manager, and dcgm, all to no avail. It seems to be specifically when running torch in a container.

I think this is different to other issues, which have different Error codes, like 803, or 804.

Any help, including general info on running pytorch in a container, would be appreciated.

Thanks!

Are you able to run any other PyTorch docker container from docker.hub or NGC?

Running the example container:

Gives me the same error:

ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available.
[[ Operation not supported (error 801) ]]