Torch.cuda.is_available() never returns anything

Ivan_Sizykh · May 11, 2023, 4:15pm

After I restarted docker container with torch based application torch.cuda.is_available() freezes there, not letting to start application over. Not only it never returns anything inside container now, but also on outside as well in different environments.

CUDA Version: 12.0. OS Ubuntu 20.04. Container image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel

The last error in logs before it started happening:
Traceback (most recent call last):
File “/code/app.py”, line 2, in
from few_pixels_as_function import few_pixels_attack
File “/code/few_pixels_as_function.py”, line 24, in
seed = torch.tensor(seed).to(device)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

dmesg --level=err returns nothing. warn level as well.

Any help much appreciated

Update: I checked what now torch.cuda.memory_reserved() and torch.cuda.memory_allocated return. It’s zero in both cases, even thou nvidia-smi shows me it’s not true. torch.cuda.init() also stucks indefinetely. Is it something with drivers?

ptrblck · May 12, 2023, 8:36am

The errors sound like driver issues and I would recommend reinstalling them and updating PyTorch to the latest version if possible.