Hi all,
We have been running a lot of VM’s containerized using docker with the docker base image:
pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime
and the script:
import subprocess
import torch
if __name__ == '__main__':
print('Testing NVIDIA SMI')
try:
# Run the nvidia-smi command
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True, check=True)
# Print the output
print("Result", result)
print("Result stdout: ", result.stdout)
except subprocess.CalledProcessError as e:
print(f"subprocess.CalledProcessError occurred: {e}")
print('Testing CUDA')
try:
print("CUDA CHECK ---------------------")
print('CUDA available running "torch.cuda.is_available()":', torch.cuda.is_available())
print('CUDA device count:"torch.cuda.device_count()"', torch.cuda.device_count())
print('CUDA device name:', torch.cuda.get_device_name(0))
if torch.cuda.is_available():
for idx in range(torch.cuda.device_count()):
name = torch.cuda.get_device_name(idx)
cap = torch.cuda.get_device_capability(idx)
mem = torch.cuda.get_device_properties(idx).total_memory/1e9
print(f"{name} (sm{cap[0]}{cap[1]}) {mem:.1f} GB")
except Exception as e:
print(f"Error occurred: {e}")
and the scripts ran are mostly succesful, but sometimes fail. It fails about 10% of the time and throws the following:
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:789: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
as you also can see, this is thrown before the script is ran, which suggest something is off with docker or the host machine. Note that this usually runs fine and happens randomly.
Exactly what the error is, is difficult to say. Looking at the codebase here(there is another occurence as well) and the docs for what nvmlInit function returns, I do not know why it fails. Whether I should post this here or make a Github issue, I do not know.
I am asking for some guidance on this though. We want stability on our VM’s, but this sudden occurrence of failing to initialize causes us trouble. Any help would be greatly appreciated!
Cheers,
Tov