ECC error for working GPU

Maxim_Lubov · November 1, 2024, 3:18pm

I have very strange error. I’ve worked on a server with two GPUs. I’ve installed python 3.10.11 and torch version for CUDA12.1.
nvcc --version gives me:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

However nvidia-smi gives: NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4

But anyways cuda is avalialbe. I try to test on simple code.

import os
import torch

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

print("CUDA available:", torch.cuda.is_available())
print("PyTorch version:", torch.__version__)
print("CUDA version (PyTorch):", torch.version.cuda)

if torch.cuda.is_available():
    device = torch.device("cuda")
    torch.device 

    total_memory = torch.cuda.get_device_properties(device).total_memory
    allocated_memory = torch.cuda.memory_allocated(device)
    reserved_memory = torch.cuda.memory_reserved(device)
    free_memory = total_memory - reserved_memory 
    properties = torch.cuda.get_device_properties(device)
    print(f"Device Name: {properties.name}")

    print(f"Total memory: {total_memory / 1024**3:.2f} GB")
    print(f"Allocated memory: {allocated_memory / 1024**3:.2f} GB")
    print(f"Reserved memory: {reserved_memory / 1024**3:.2f} GB")
    print(f"Free memory: {free_memory / 1024**3:.2f} GB")

    a = torch.rand(2, 2, device=device)
    b = torch.rand(2, 2, device=device)

    result = a + b

    print("Tensor a:\n", a)
    print("Tensor b:\n", b)
    print("Result of a + b:\n", result)

else:
    print("CUDA is not available.")

And i get as a result

CUDA available: True
PyTorch version: 2.5.1+cu124
CUDA version (PyTorch): 12.4

Device Name: NVIDIA H100 PCIe
Total memory: 79.10 GB
Allocated memory: 0.00 GB
Reserved memory: 0.00 GB
Free memory: 79.10 GB
Traceback (most recent call last):
  File "/home/mnlubov/test/test.py", line 31, in <module>
    a = torch.rand(2, 2, device=device)
RuntimeError: CUDA error: uncorrectable ECC error encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I can’t understand why there is such error: CUDA error: uncorrectable ECC error encountered
I’ve checked for ECC errors and everything is fine.

Fri Nov  1 15:17:30 2024
Driver Version                            : 550.127.05
CUDA Version                              : 12.4

Attached GPUs                             : 2
GPU 00000000:81:00.0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0

The same is for the second GPU. So, it seems there is no problem with GPU itself.
What is the reason and how to fix it?
IS it due to incorrect torch version?

ptrblck · November 1, 2024, 6:59pm

PyTorch is not causing these issues and is only re-raising sticky errors reported to it, so you might want to check if any Xids are detected in dmesg and also run any other CUDA example to see if it would also raise these errors.

Maxim_Lubov · November 2, 2024, 6:01am

I tried to run simple cuda programs, like adding two number and matrix multiplication, and they work fine. So, I suppose the problem is not with CUDA itself, since it works.