I have very strange error. I’ve worked on a server with two GPUs. I’ve installed python 3.10.11 and torch version for CUDA12.1.
nvcc --version
gives me:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
However nvidia-smi
gives: NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4
But anyways cuda is avalialbe. I try to test on simple code.
import os
import torch
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
print("CUDA available:", torch.cuda.is_available())
print("PyTorch version:", torch.__version__)
print("CUDA version (PyTorch):", torch.version.cuda)
if torch.cuda.is_available():
device = torch.device("cuda")
torch.device
total_memory = torch.cuda.get_device_properties(device).total_memory
allocated_memory = torch.cuda.memory_allocated(device)
reserved_memory = torch.cuda.memory_reserved(device)
free_memory = total_memory - reserved_memory
properties = torch.cuda.get_device_properties(device)
print(f"Device Name: {properties.name}")
print(f"Total memory: {total_memory / 1024**3:.2f} GB")
print(f"Allocated memory: {allocated_memory / 1024**3:.2f} GB")
print(f"Reserved memory: {reserved_memory / 1024**3:.2f} GB")
print(f"Free memory: {free_memory / 1024**3:.2f} GB")
a = torch.rand(2, 2, device=device)
b = torch.rand(2, 2, device=device)
result = a + b
print("Tensor a:\n", a)
print("Tensor b:\n", b)
print("Result of a + b:\n", result)
else:
print("CUDA is not available.")
And i get as a result
CUDA available: True
PyTorch version: 2.5.1+cu124
CUDA version (PyTorch): 12.4
Device Name: NVIDIA H100 PCIe
Total memory: 79.10 GB
Allocated memory: 0.00 GB
Reserved memory: 0.00 GB
Free memory: 79.10 GB
Traceback (most recent call last):
File "/home/mnlubov/test/test.py", line 31, in <module>
a = torch.rand(2, 2, device=device)
RuntimeError: CUDA error: uncorrectable ECC error encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I can’t understand why there is such error: CUDA error: uncorrectable ECC error encountered
I’ve checked for ECC errors and everything is fine.
Fri Nov 1 15:17:30 2024
Driver Version : 550.127.05
CUDA Version : 12.4
Attached GPUs : 2
GPU 00000000:81:00.0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
SRAM Threshold Exceeded : No
Aggregate Uncorrectable SRAM Sources
SRAM L2 : 0
SRAM SM : 0
SRAM Microcontroller : 0
SRAM PCIE : 0
SRAM Other : 0
The same is for the second GPU. So, it seems there is no problem with GPU itself.
What is the reason and how to fix it?
IS it due to incorrect torch version?