test_cuda.py:
import torch
def check_cuda():
print("Is CUDA available in PyTorch:", torch.cuda.is_available())
if torch.cuda.is_available():
print("Number of CUDA devices:", torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
print("CUDA Device #{}: {}".format(i, torch.cuda.get_device_name(i)))
if __name__ == "__main__":
check_cuda()
Dockerfile:
FROM python:3.8
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
COPY test_cuda.py /test_cuda.py
CMD ["python3", "/test_cuda.py"]
Machine: AzureStandard NCC24ads A100 v4 (24 vcpus, 220 GiB memory)
OS: Ubuntu 20.04
nvidia-smi:
Mon Jan 22 20:30:30 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.10.12 Driver Version: 470.10.12 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000001:00:00.0 Off | 0 |
| N/A 28C P0 42W / 300W | 10MiB / 81251MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If I start a python shell and run torch.cuda.is_available(), I receive True.
If I run “sudo docker run --rm --gpus all ubuntu nvidia-smi”, I get the expected output.
When I try to run a tensorflow image, it detects and uses the GPU.
But when I try to run my test dockerfile with torch, I receive this error:
/usr/local/lib/python3.8/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization:
Unexpected error from cudaGetDeviceCount().
Did you run some cuda functions before calling NumCudaDevices() that might have already set an error?
Error 801: operation not supported (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
I have tried to change driver versions, install fabric manager, and dcgm, all to no avail. It seems to be specifically when running torch in a container.
I think this is different to other issues, which have different Error codes, like 803, or 804.
Any help, including general info on running pytorch in a container, would be appreciated.
Thanks!