Can't initialize NVML: Ambiguous response

Hi all,

We have been running a lot of VM’s containerized using docker with the docker base image:
pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime
and the script:

import subprocess
import torch


if __name__ == '__main__':
    print('Testing NVIDIA SMI')
    try:
        # Run the nvidia-smi command
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True, check=True)
        # Print the output
        print("Result", result)
        print("Result stdout: ", result.stdout)
    except subprocess.CalledProcessError as e:
        print(f"subprocess.CalledProcessError occurred: {e}")
    print('Testing CUDA')
    try:
        print("CUDA CHECK ---------------------")
        print('CUDA available running "torch.cuda.is_available()":', torch.cuda.is_available())
        print('CUDA device count:"torch.cuda.device_count()"', torch.cuda.device_count())
        print('CUDA device name:', torch.cuda.get_device_name(0))
        if torch.cuda.is_available():
            for idx in range(torch.cuda.device_count()):
                name = torch.cuda.get_device_name(idx)
                cap = torch.cuda.get_device_capability(idx)
                mem = torch.cuda.get_device_properties(idx).total_memory/1e9
            print(f"{name}  (sm{cap[0]}{cap[1]})  {mem:.1f} GB")
    except Exception as e:
        print(f"Error occurred: {e}")

and the scripts ran are mostly succesful, but sometimes fail. It fails about 10% of the time and throws the following:

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:789: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

as you also can see, this is thrown before the script is ran, which suggest something is off with docker or the host machine. Note that this usually runs fine and happens randomly.

Exactly what the error is, is difficult to say. Looking at the codebase here(there is another occurence as well) and the docs for what nvmlInit function returns, I do not know why it fails. Whether I should post this here or make a Github issue, I do not know.

I am asking for some guidance on this though. We want stability on our VM’s, but this sudden occurrence of failing to initialize causes us trouble. Any help would be greatly appreciated!

Cheers,
Tov

I have ran some more tests which seems to indicate that something strange is going on with pytorch. This is how the output look after running the script on a node(cleaned up a bit for readability):

Testing NVIDIA SMI
Result stdout:  Thu Jul 17 11:38:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 NVL                On  |   00000001:00:00.0 Off |                    0 |
| N/A   34C    P0             62W /  400W |       0MiB /  95830MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 NVL                On  |   00000002:00:00.0 Off |                    0 |
| N/A   33C    P0             62W /  400W |       0MiB /  95830MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Testing CUDA
CUDA CHECK ---------------------
CUDA available running "torch.cuda.is_available()": True
CUDA device count:"torch.cuda.device_count()" 2
CUDA device name: NVIDIA H100 NVL
NVIDIA H100 NVL  (sm90)  100.0 GB

and this is the output on an identical VM(but not the same):

Testing NVIDIA SMI
Result stdout:  Thu Jul 17 11:39:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 NVL                On  |   00000001:00:00.0 Off |                    0 |
| N/A   35C    P0             63W /  400W |       0MiB /  95830MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 NVL                On  |   00000002:00:00.0 Off |                    0 |
| N/A   34C    P0             62W /  400W |       0MiB /  95830MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Testing CUDA
CUDA CHECK ---------------------
CUDA available running "torch.cuda.is_available()": False
CUDA device count:"torch.cuda.device_count()" 0
Error occurred: No CUDA GPUs are available

As far as I know, this does not make any sense. Do I have to explicitly set something to make sure that pytorch knows about the GPUs?

No, PyTorch will use the CUDA runtime to detect GPUs and is most likely not causing the issues you are seeing. You could run any standalone CUDA application (e.g. from the CUDA samples or any custom code) on the failing node to further isolate why GPUs cannot be detected anymore.

Thanks for replying. Since the command nvidia-smi is able to detect the GPU’s, does that imply that CUDA should also be able to detect the GPU’s? That is, something is going wrong with the CUDA toolkit/installation or in any way, it is a CUDA issue?

No, it doesn’t since the driver could still run into issues which would be raised during the runtime.

Ok, so the nvidia drivers cannot be that easily disregarded as the potential issue.
Could you by chance point me to some sample code that could be used to debug the issue?

I still think that the parsing of the potential error could be improved for the consumer(me), as I do not have a really good idea on how to approach that. The reason I am saying that is that I have a hunch that it might be a re-initializing error, but I have no way of knowing from the response without doing a lot of work, even though it could be presented.

Any example from cuda-samples should work as another test assuming you are using a locally installed CUDA toolkit and are able to build these tests.

Yes, thanks for the feedback as we should check if the error message can be improved.

Ok, I will perform some testing. Thanks for the help!

Hi @ptrblck ,

I have a question regarding the implementation in torch; why does it never call nvmlShutdown()? Is it up to the user to properly shut it down after it is done communicating with the GPU’s?

Edit: I think I understand that short lived tasks does not need to call nvmlShutdown, as a fresh instance of the host will probably be present. In our case though, we most likely will not have a host which has most of its services reset. Since whenever pytorch runs, it does not really release the nvml resources it has requested, could that be of importance to understanding our error?

Once the Python process terminates, the CUDARuntime will shut down all resources including NVML as well as any CUDA Math lib handles. PyTorch itself does not provide interfaces to shutdown the CUDA runtime or reset the CUDA Context etc. manually.

I don’t understand this description and and e.g. what “fresh instance of the host” means. No, I don’t think any of our NVML usage is related to the issues you are seeing as I assume also plain CUDA samples will also break, which you did not confirm yet.

I did not know that the CUDARuntime will respond in that manner, thanks for clarifying that. I know that Pytorch does not provide interfaces for that, which led to my questioning.

I was underway to try out the CUDA samples, but it has not been easy to automate. The reason I want to automate it is because CUDARuntime only fails 1/10 times and our nodes are quickly deallocated to save resources. But I will get around to test it, it seems like a really good test to perform.

@ptrblck

Hi again,

I have now ran the samples on the machine. Before testing I verified whether the node could run the usual torch commands it fails on and in this case the machine can run ```torch.cuda.is_available()``` succesfully.

Regardless, the output samples give the following output during building:

`nvcc fatal : Unsupported gpu architecture ‘compute_110’`

multiple times. The build also fails later on due to some invalid conversion being performed. Which does not look promising. This does not belong in this forum, but while we are at if you were to have more experience with this:

I have ran nvcc -V:
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

and in /usr/local/cuda-12.8/version.json:

“cuda” : {
“name” : “CUDA SDK”,
“version” : “12.8.1”
},

and

“cuda_nvcc” : {
“name” : “CUDA NVCC”,
“version” : “12.8.93”
},

also, during one of the test in the Samples(I ran the 1_utilities isolated), I got the following output:

Device 3: “Tesla T4”
CUDA Driver Version / Runtime Version 12.8 / 12.8
CUDA Capability Major/Minor version number: 7.5

Now, searching the web, CUDA 12.8 should support the Turing machines. But I do not know why it is looking for compute 110, and if it is the CUDA toolkit or GPU’s that are failing. I have looked at all /usr/local/cuda../ folders and they all point to the same versions.

The only difference I can notice is that /usr/local/cuda../ has:

“nvidia_driver” : {
“name” : “NVIDIA Linux Driver”,
“version” : “570.124.06”
},

and nvidia-smi has:

NVIDIA-SMI 570.133.20 Driver Version: 570.133.20

If you have any insight, it will be greatly appreciated.

This error is unrelated to the one you are seeing in this thread (losing access to your GPUs) and is caused by a mismatch in GPU architectures you are trying to build in the CUDA samples and your locally installed CUDA toolkit, which is too old and does not support sm_110 yet (CUDA 13+ is needed).

As a simple workaround, remove 110 (and any other unwanted architectures) from here.

I removed the unwanted architectures from the CMakeLists.txt files in the projects and built using:

$make -j$(nproc) –ignore-errors

since there were a lot of errors, type mismatches and such, and after that i ran:

python3 run_tests.py –parallel 4

which found around 77 executeables I think, and these were the ones that failed:
Failed runs (26):
update.sample : Failed (code 1)
pre-merge-commit.sample : Failed (code 2)
push-to-checkout.sample : Failed (code 1)
pre-applypatch.sample : Failed (code 2)
applypatch-msg.sample : Failed (code 2)
CMakeDetermineCompilerABI_C.bin : Failed (code 49)
CMakeDetermineCompilerABI_CXX.bin : Failed (code 49)
a.out : Failed (code 212)
a.out : Failed (code 212)
a.out : Failed (code 190)
fsmonitor-watchman.sample : Failed (code 255)
streamOrderedAllocationP2P : Failed (code 2)
ptxgen : Failed (code 16)
cuda_f_1.yuv : Error: [Errno 8] Exec format error: ‘./cuda_f_1.yuv’ (code -1)
cuda_yuv_f_1.yuv : Error: [Errno 8] Exec format error: ‘./cuda_yuv_f_1.yuv’ (code -1)
cuda_yuv_f_2.yuv : Error: [Errno 8] Exec format error: ‘./cuda_yuv_f_2.yuv’ (code -1)
CMakeDetermineCompilerABI_C.bin : Failed (code 49)
CMakeDetermineCompilerABI_CXX.bin : Failed (code 49)
a.out : Failed (code 212)
a.out : Failed (code 212)
a.out : Failed (code 190)
build.bat : Error: [Errno 8] Exec format error: ‘./build.bat’ (code -1)
build.sh : Failed (code 2)
shaders.hlsl : Error: [Errno 8] Exec format error: ‘./shaders.hlsl’ (code -1)
stdafx.cpp : Error: [Errno 8] Exec format error: ‘./stdafx.cpp’ (code -1)
pre-push.sample : Timeout (code -1)

I do not know what to make of this. Since things are failing during building, it is unclear to me what is supposed to happen. Mind you I do not have experience with c++ so the depth of the errors are unclear. Also, looking at the APM_ output for the ones that failed, there is nothing that really stands out, most are empty.