My second gpu "Tesla V100-PCIE-32GB" disappears after running the code of transformers or after some time of work

My second GPUs disappears without any reason, sometimes after running the code or after some time of work.

My first GPU is GP107GL [Quadro P1000] and second one is Tesla V100-PCIE-32GB. I am using the first one as display for my secreens.

The GPU appears again if I restart the device.

My OS is Ubuntu 20.04.5 LTS

Is there any way to solve this problem?

The code I am running is as the following if this will help:

# %%
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# %%
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# %%
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# %%
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# %%

tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

# %%
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

# %%

# for batch in train_dataloader:
#     break
batch = next(iter(train_dataloader))
{k: v.shape for k, v in batch.items()}

# %%

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# %% 

outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

# %%
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)
# %%

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)
# %%

import torch

print([(i, torch.cuda.get_device_properties(i)) for i in range(torch.cuda.device_count())])
# num_of_gpus = torch.cuda.device_count()
# print("The Number of the GPUs are: ", num_of_gpus)

# print("Current GPU", torch.cuda.current_device())

# torch.cuda.device(2)
# torch.cuda.set_device(0)
# print("New Selected GPU", torch.cuda.current_device())

# %%
import os 

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# %%
torch.cuda.set_device(0)
# device = torch.device("cuda") if torch.cuda.is_available() else torch.device('cpu')
# print(device)
torch.cuda.get_arch_list()
torch.cuda.get_device_properties("cuda:0")
# torch.cuda.get_device_properties()
print("New Selected GPU", torch.cuda.current_device())

device = "cuda:0"
# %%

model.to(device)

# %%
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    print("Epoch: " , epoch)
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
## %%

# %%
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

The issue sounds more like a system/driver/hardware issue and I doubt it’s related to PyTorch.
I would thus recommend to run a few simple CUDA tests and check if the GPU also drops.
If so, check if any Xids are shown in dmesg which could indicate why it’s failing.

It doesn’t produce any message, just it disappears.

Anyway,I run dmesg command if you can get anything from it, thanks a lot.

The output is here:

https://drive.google.com/file/d/1BGmCMz3oS-cM3W_kwb9KsoF9sn-U6wlI/view?usp=share_link

The error is:

[ 1361.908162] NVRM: GPU at PCI:0000:d8:00: GPU-d1a5f877-65cb-a62e-4192-ae05bb68fc48
[ 1361.908175] NVRM: GPU Board Serial Number: 1560121001476
[ 1361.908178] NVRM: Xid (PCI:0000:d8:00): 79, pid=0, GPU has fallen off the bus.
[ 1361.908186] NVRM: GPU 0000:d8:00.0: GPU has fallen off the bus.
[ 1361.908191] NVRM: GPU 0000:d8:00.0: GPU serial number is 1560121001476.
[ 1361.908210] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

Based on this table it could be caused by a:

  • HW error
  • Driver issue
  • System Memory Corruption
  • Bus Error
  • Thermal Issue

A while ago a user was seeing the same issue and realized that the power cable wasn’t properly plugged into the GPU, which caused the same Xid, so you might want to start with this.

1 Like

I will try this and back to you. Thanks a lot.

Hi @ptrblck

I have tried the solution and made sure the power cable is plugged probably but the GPU is disappearing again.

Hi @ptrblck

Thanks for your help. Any other tips to solve this problem? thanks a lot.

No unfortunately I don’t have any other advice besides trying to narrow down potential root causes for the aforementioned issues.

1 Like

Hi,

thanks a lot for your help and support. I tracked the issue and it seems from the heating.

How can I cool my GPU?

Your GPU should already have a fan so make sure to leave enough space for air circulation. I’m sure you can find a lot of helpful blog posts discussing good thermal performance in workstations, which cases are best, etc.

1 Like

To be honest, it does not have a fan. Anyway, I have added two fans for cooling and it works like a monster.

Thanks a lot for your help and support.

I assume you did not use a server with active cooling but plugged the GPU into your workstation without any airflow?

Yes, this was a problem. It is a workstation.

OK, good you’ve isolated the issue.
Btw. I mistakenly assumed you were using a Titan V when I claimed it has a fan, but now realized you are using a V100.

1 Like

Thanks a lot for the help, it was very useful for me.

Hope it’s okay to bump this thread.

I have some additional details and the given solution doesn’t quite seem to be working for me.

Initially I thought this was an nVidia issue and opened a report on their forums (linked here: Ubuntu 22.04 - GPU Falls off Bus - Unable to determine the device handle for GPU0000:01:00.0: Unknown Error - Linux - NVIDIA Developer Forums ) but after debugging a bit I’m suspicious of PyTorch.

I don’t think it’s a thermal issue because the device temperatures don’t seem to go outside of the recommended operating envelope.

I don’t think it’s power related because I can run gpu_burn at full tilt for an hour and the issue I encounter pops up after just a few minutes (sometimes).

I can train a LORA just fine overnight, but when I do validations in the middle of the operation the card will deadlock with fans at full tilt. I can’t kill Python, even via sudo kill -9 <pid>.

I thought it might be a driver issue, so I updated from 525 to 530. No luck there.

I thought it might be a PyTorch issue, so I upgraded from 1.12 to 1.13 to no avail, then to 2.0. Still no luck.

I can remotely attach to the process via GDB. I don’t have any debug symbols because it’s inside a proprietary CUDA driver. Here’s what I see:

(gdb) info threads
  Id   Target Id                                         Frame
  1    Thread 0x7f12a6f21b80 (LWP 4438) "python"         0x00007f1286cda79c in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  2    Thread 0x7f122adff640 (LWP 4494) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58eae0 <thread_status+96>) at ./nptl/futex-internal.c:57
  3    Thread 0x7f122a5fe640 (LWP 4495) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58eb60 <thread_status+224>) at ./nptl/futex-internal.c:57
  4    Thread 0x7f1229dfd640 (LWP 4496) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ebe0 <thread_status+352>) at ./nptl/futex-internal.c:57
  5    Thread 0x7f12295fc640 (LWP 4497) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ec60 <thread_status+480>) at ./nptl/futex-internal.c:57
  6    Thread 0x7f1224dfb640 (LWP 4498) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ece0 <thread_status+608>) at ./nptl/futex-internal.c:57
  7    Thread 0x7f12205fa640 (LWP 4499) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ed60 <thread_status+736>) at ./nptl/futex-internal.c:57
  8    Thread 0x7f121ddf9640 (LWP 4500) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ede0 <thread_status+864>) at ./nptl/futex-internal.c:57
  9    Thread 0x7f120367e640 (LWP 4735) "python"         0x00007f129c818b31 in ?? () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  10   Thread 0x7f1202e7d640 (LWP 4736) "python"         0x00007f129c818b31 in ?? () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  11   Thread 0x7f120267c640 (LWP 4737) "python"         0x00007f129c818b31 in ?? () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  12   Thread 0x7f10e1172640 (LWP 4744) "cuda-EvtHandlr" 0x00007f12a7122d7f in __GI___poll (fds=0x55939d7964a0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
* 13   Thread 0x7f10e0971640 (LWP 4745) "cuda-EvtHandlr" 0x00007f1286cda799 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  14   Thread 0x7f10dbd9e640 (LWP 4746) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7f10dbd9dde0, op=393, expected=0, futex_word=0x559389e86698) at ./nptl/futex-internal.c:57
  15   Thread 0x7f1201e3b640 (LWP 4747) "python"         __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7f1201e39ed0, op=393, expected=0, futex_word=0x7f10f0000fe0) at ./nptl/futex-internal.c:57
  16   Thread 0x7f120163a640 (LWP 4748) "python"         __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7f12016390b0, op=393, expected=0, futex_word=0x7f0fb8001200) at ./nptl/futex-internal.c:57
  17   Thread 0x7f1200d79640 (LWP 4749) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5594846c867c) at ./nptl/futex-internal.c:57
  18   Thread 0x7f11fbfff640 (LWP 4750) "python"         __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7f11fbffe0b0, op=393, expected=0, futex_word=0x7f0e90001200) at ./nptl/futex-internal.c:57

I can trace Python up and down the stack, but it seems like there’s some custom kernel in PyTorch that leads to everything waiting on a mutex.

Perhaps I have the order of causation wrong and it’s actually the card falling off the bus that happens first with the deadlock happening later? At the least, I think I have it narrowed down to a device issue and an issue in PyTorch.

Not sure how to debug further, though, to reiterate, I have succeeded in running this code end-to-end with validation turned off.

If the GPU falls off the bus the code would get stuck somewhere so I would be careful claiming a specific kernel causes the issue and might just be the victim.

Yes, I would think so.

I doubt it’s an application issue (in this case PyTorch) unless you can provide a code snippet which would fail in another setup.

1 Like

Appreciate the impressively fast reply.

I guess I’ll have to reach out to the card manufacturer. nVidia hasn’t yet replied in the other thread, but I’m going to see if the OEM has any input on stuff to do. Until then I’ll see if I can narrow it down or just turn off validation for a spell. My best guess is something related to context switching or loading/unloading models when switching from train to eval. I’ll follow up here if anything comes of it.

Thanks again.

I would also recommend to check the debug steps mentioned in this thread, in particular check dmesg for Xids to see if you would get any better error code.

Thank you. I already pulled dmesg and dumped it in the other thread yesterday. Aside from the stack trace and a notice about a driver crash there’s not much [440008.299192] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.. The CIFS malformed message is just me failing to mount network storage for training and the perf_interrupt is just auto-setting interrupt timing.

[    5.605621] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    5.652453] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.105.17  Tue Mar 28 18:02:59 UTC 2023
[    5.663885] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.105.17  Tue Mar 28 22:18:37 UTC 2023
[    5.681370] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    5.681372] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[    5.912934] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[    5.916542] nvidia-uvm: Loaded the UVM driver, major device number 507.
[    9.173026] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[    9.173101] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s31f6: link becomes ready
[   17.852287] loop5: detected capacity change from 0 to 8
[350775.385809] FS-Cache: Loaded
[350775.403050] FS-Cache: Netfs 'cifs' registered for caching
[350775.405237] Key type cifs.spnego registered
[350775.405245] Key type cifs.idmap registered
[350775.405500] Malformed UNC in devname
...
[350775.405519] CIFS: VFS: Malformed UNC in devname
[350920.653476] CIFS: No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3.1.1), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3.1.1 (or even SMB3 or SMB2.1) specify vers=1.0 on mount.
[350920.653485] CIFS: Attempting to mount \\synology\MLData
[350920.695369] CIFS: Status code returned 0xc000006d STATUS_LOGON_FAILURE
[350920.695391] CIFS: VFS: \\synology Send error in SessSetup = -13
[351076.195760] CIFS: Attempting to mount \\synology\MLData
[352377.832353] CIFS: Attempting to mount \\synology\MLData
...
[362713.785429] perf: interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[398329.008318] Lockdown: mdadm: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
[398384.755321] Lockdown: mdadm: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
[440008.299189] NVRM: GPU at PCI:0000:01:00: GPU-508f8624-3013-b396-84aa-c207917faf36
[440008.299192] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[440008.299194] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[440008.300548] NVRM: A GPU crash dump has been created. If possible, please run
                NVRM: nvidia-bug-report.sh as root to collect this data before
                NVRM: the NVIDIA kernel module is unloaded.
[440553.072866] sysrq: Show backtrace of all active CPUs
[440553.072884] NMI backtrace for cpu 1
[440553.072885] CPU: 1 PID: 107535 Comm: nvidia-bug-repo Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.072887] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.072888] Call Trace:
[440553.072889]  <TASK>
[440553.072890]  show_stack+0x52/0x5c
[440553.072894]  dump_stack_lvl+0x4a/0x63
[440553.072896]  dump_stack+0x10/0x16
[440553.072897]  nmi_cpu_backtrace.cold+0x4d/0x93
[440553.072899]  ? lapic_can_unplug_cpu+0x90/0x90
[440553.072902]  nmi_trigger_cpumask_backtrace+0xec/0x100
[440553.072905]  arch_trigger_cpumask_backtrace+0x19/0x20
[440553.072908]  sysrq_handle_showallcpus+0x17/0x20
[440553.072910]  __handle_sysrq.cold+0xc9/0x1a6
[440553.072912]  ? apparmor_file_permission+0x70/0x160
[440553.072914]  write_sysrq_trigger+0x28/0x40
[440553.072916]  proc_reg_write+0x5b/0xa0
[440553.072918]  ? __cond_resched+0x1a/0x50
[440553.072921]  vfs_write+0xc4/0x270
[440553.072923]  ksys_write+0x67/0xf0
[440553.072924]  __x64_sys_write+0x19/0x20
[440553.072926]  do_syscall_64+0x59/0xc0
[440553.072928]  ? syscall_exit_to_user_mode+0x27/0x50
[440553.072930]  ? __x64_sys_close+0x11/0x50
[440553.072932]  ? do_syscall_64+0x69/0xc0
[440553.072934]  ? __x64_sys_close+0x11/0x50
[440553.072935]  ? do_syscall_64+0x69/0xc0
[440553.072937]  ? irqentry_exit_to_user_mode+0x9/0x20
[440553.072939]  ? irqentry_exit+0x1d/0x30
[440553.072940]  ? exc_page_fault+0x89/0x170
[440553.072942]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[440553.072943] RIP: 0033:0x7f5bb4424a37
[440553.072946] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[440553.072947] RSP: 002b:00007ffcfe47d618 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[440553.072949] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5bb4424a37
[440553.072950] RDX: 0000000000000002 RSI: 000055e971460560 RDI: 0000000000000001
[440553.072951] RBP: 000055e971460560 R08: 000055e971456f02 R09: 0000000000000000
[440553.072952] R10: 000055e971456f01 R11: 0000000000000246 R12: 0000000000000001
[440553.072953] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[440553.072955]  </TASK>
[440553.072956] Sending NMI from CPU 1 to CPUs 0,2-7:
[440553.072960] NMI backtrace for cpu 5
[440553.072961] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.072963] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.072964] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.072966] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.072967] RSP: 0018:ffffa924400fbdf0 EFLAGS: 00000046
[440553.072968] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.072969] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.072970] RBP: ffffa924400fbe18 R08: 000190ae40932dc6 R09: 0000000000000000
[440553.072971] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.072971] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.072972] FS:  0000000000000000(0000) GS:ffff9ae2b6540000(0000) knlGS:0000000000000000
[440553.072974] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.072975] CR2: 0000562e6e0374d8 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.072976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.072976] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.072977] Call Trace:
[440553.072977]  <TASK>
[440553.072978]  ? intel_idle_ibrs+0x4d/0xd0
[440553.072980]  cpuidle_enter_state+0x97/0x620
[440553.072982]  ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.072984]  cpuidle_enter+0x2e/0x50
[440553.072985]  cpuidle_idle_call+0x142/0x1e0
[440553.072987]  do_idle+0x83/0xf0
[440553.072988]  cpu_startup_entry+0x20/0x30
[440553.072990]  start_secondary+0x12a/0x180
[440553.072992]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.072995]  </TASK>
[440553.072996] NMI backtrace for cpu 3
[440553.072997] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.072999] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073000] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073002] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073004] RSP: 0018:ffffa924400ebdf0 EFLAGS: 00000046
[440553.073005] RAX: 0000000000000020 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073006] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000020
[440553.073007] RBP: ffffa924400ebe18 R08: 000190ae416871a1 R09: 00000000000c3500
[440553.073008] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000004
[440553.073008] R13: ffffffffaf4d49c0 R14: 0000000000000004 R15: ffffffffaf4d4b78
[440553.073009] FS:  0000000000000000(0000) GS:ffff9ae2b64c0000(0000) knlGS:0000000000000000
[440553.073010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073011] CR2: 00007f31bf953a70 CR3: 0000000324610003 CR4: 00000000003706e0
[440553.073012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073013] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073014] Call Trace:
[440553.073014]  <TASK>
[440553.073015]  ? intel_idle_ibrs+0x4d/0xd0
[440553.073017]  cpuidle_enter_state+0x97/0x620
[440553.073019]  cpuidle_enter+0x2e/0x50
[440553.073020]  cpuidle_idle_call+0x142/0x1e0
[440553.073022]  do_idle+0x83/0xf0
[440553.073024]  cpu_startup_entry+0x20/0x30
[440553.073025]  start_secondary+0x12a/0x180
[440553.073027]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073030]  </TASK>
[440553.073030] NMI backtrace for cpu 7
[440553.073031] CPU: 7 PID: 0 Comm: swapper/7 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073033] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073034] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073035] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073036] RSP: 0018:ffffa9244010bdf0 EFLAGS: 00000046
[440553.073037] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073038] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073039] RBP: ffffa9244010be18 R08: 000190ae414ab521 R09: 00000000000c3500
[440553.073039] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073040] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073041] FS:  0000000000000000(0000) GS:ffff9ae2b65c0000(0000) knlGS:0000000000000000
[440553.073042] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073042] CR2: 000055a281de74d8 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.073043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073044] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073044] Call Trace:
[440553.073045]  <TASK>
[440553.073045]  ? intel_idle_ibrs+0x4d/0xd0
[440553.073047]  cpuidle_enter_state+0x97/0x620
[440553.073048]  cpuidle_enter+0x2e/0x50
[440553.073049]  cpuidle_idle_call+0x142/0x1e0
[440553.073051]  do_idle+0x83/0xf0
[440553.073052]  cpu_startup_entry+0x20/0x30
[440553.073053]  start_secondary+0x12a/0x180
[440553.073055]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073057]  </TASK>
[440553.073058] NMI backtrace for cpu 0
[440553.073059] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073061] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073061] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073064] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073065] RSP: 0018:ffffffffaf203d88 EFLAGS: 00000046
[440553.073066] RAX: 0000000000000010 RBX: 0000000000000003 RCX: 0000000000000001
[440553.073067] RDX: 0000000000000000 RSI: ffffffffaf4d49c0 RDI: 0000000000000010
[440553.073068] RBP: ffffffffaf203da8 R08: 000190ae41667cf8 R09: 0000000000030d40
[440553.073069] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000003
[440553.073070] R13: ffffffffaf4d49c0 R14: 0000000000000003 R15: ffffffffaf4d4b10
[440553.073071] FS:  0000000000000000(0000) GS:ffff9ae2b6400000(0000) knlGS:0000000000000000
[440553.073072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073073] CR2: 00005619c6c68000 CR3: 0000000324610002 CR4: 00000000003706f0
[440553.073074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073074] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073075] Call Trace:
[440553.073076]  <TASK>
[440553.073076]  ? intel_idle+0x30/0x50
[440553.073078]  cpuidle_enter_state+0x97/0x620
[440553.073081]  cpuidle_enter+0x2e/0x50
[440553.073082]  cpuidle_idle_call+0x142/0x1e0
[440553.073084]  do_idle+0x83/0xf0
[440553.073085]  cpu_startup_entry+0x20/0x30
[440553.073086]  rest_init+0xd3/0x100
[440553.073088]  ? acpi_enable_subsystem+0x20b/0x217
[440553.073090]  arch_call_rest_init+0xe/0x23
[440553.073092]  start_kernel+0x4a9/0x4ca
[440553.073094]  x86_64_start_reservations+0x24/0x2a
[440553.073095]  x86_64_start_kernel+0xfb/0x106
[440553.073097]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073100]  </TASK>
[440553.073100] NMI backtrace for cpu 4
[440553.073101] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073103] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073103] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073105] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073106] RSP: 0018:ffffa924400f3df0 EFLAGS: 00000046
[440553.073107] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073108] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073108] RBP: ffffa924400f3e18 R08: 000190ae414a42b3 R09: 0000000000000000
[440553.073109] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073110] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073111] FS:  0000000000000000(0000) GS:ffff9ae2b6500000(0000) knlGS:0000000000000000
[440553.073111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073112] CR2: 00007f2dbc64ea50 CR3: 0000000324610003 CR4: 00000000003706e0
[440553.073113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073114] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073114] Call Trace:
[440553.073115]  <TASK>
[440553.073115]  ? intel_idle_ibrs+0x4d/0xd0
[440553.073117]  cpuidle_enter_state+0x97/0x620
[440553.073118]  ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.073119]  cpuidle_enter+0x2e/0x50
[440553.073120]  cpuidle_idle_call+0x142/0x1e0
[440553.073122]  do_idle+0x83/0xf0
[440553.073123]  cpu_startup_entry+0x20/0x30
[440553.073125]  start_secondary+0x12a/0x180
[440553.073126]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073129]  </TASK>
[440553.073129] NMI backtrace for cpu 2
[440553.073131] CPU: 2 PID: 106415 Comm: python Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073133] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073133] RIP: 0010:entry_SYSCALL_64_after_hwframe+0x57/0xcb
[440553.073136] Code: 45 31 e4 45 31 ed 45 31 f6 45 31 ff 48 89 e7 48 63 f0 66 90 b9 48 00 00 00 65 48 8b 14 25 c8 fb 01 00 89 d0 48 c1 ea 20 0f 30 <0f> 1f 44 00 00 e8 07 3a fa ff 0f 1f 44 00 00 48 8b 4c 24 58 4c 8b
[440553.073137] RSP: 0018:ffffa92441bb3f58 EFLAGS: 00000046
[440553.073139] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000048
[440553.073140] RDX: 0000000000000000 RSI: 0000000000000018 RDI: ffffa92441bb3f58
[440553.073141] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[440553.073141] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[440553.073142] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[440553.073143] FS:  00007fb025f0ab80(0000) GS:ffff9ae2b6480000(0000) knlGS:0000000000000000
[440553.073144] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073145] CR2: 000056038f0b0cc0 CR3: 0000000808942004 CR4: 00000000003706e0
[440553.073146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073146] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073147] Call Trace:
[440553.073148]  <TASK>
[440553.073150]  </TASK>
[440553.073150] NMI backtrace for cpu 6
[440553.073151] CPU: 6 PID: 0 Comm: swapper/6 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073153] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073154] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073158] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073159] RSP: 0018:ffffa92440103df0 EFLAGS: 00000046
[440553.073161] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073162] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073163] RBP: ffffa92440103e18 R08: 000190ae410d4c9a R09: 0000000000000000
[440553.073165] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073166] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073167] FS:  0000000000000000(0000) GS:ffff9ae2b6580000(0000) knlGS:0000000000000000
[440553.073169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073170] CR2: 00007f5bb43a2db0 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.073172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073174] Call Trace:
[440553.073175]  <TASK>
[440553.073176]  ? intel_idle_ibrs+0x4d/0xd0
[440553.073179]  cpuidle_enter_state+0x97/0x620
[440553.073181]  ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.073184]  cpuidle_enter+0x2e/0x50
[440553.073186]  cpuidle_idle_call+0x142/0x1e0
[440553.073189]  do_idle+0x83/0xf0
[440553.073191]  cpu_startup_entry+0x20/0x30
[440553.073193]  start_secondary+0x12a/0x180
[440553.073196]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073200]  </TASK>
[440556.401698] snd_hda_intel 0000:01:00.1: can't change power state from D3cold to D0 (config space inaccessible)
[440556.778840] snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x4f0800. -5
[440556.778860] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[440556.778862] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1

EDIT:

I was able to complete another full run of GPU burn overnight (this time with floats).

Using compare file: compare.ptx
Burning for 25200 seconds.
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-508f8624-3013-b396-84aa-c207917faf36)
Initialized device 0 with 24257 MB of memory (23645 MB available, using 21281 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 81 iterations
10.0%  proc'd: 38475 (16187 Gflop/s)   errors: 0   temps: 79 C
        Summary at:   Sat Apr 29 07:10:22 AM UTC 2023
...
100.0%  proc'd: 378027 (16803 Gflop/s)   errors: 0   temps: 79 C
Killing processes with SIGTERM (soft kill)
Freed memory for dev 0
Uninitted cublas
done

Tested 1 GPUs:
        GPU 0: OK                                                                                    

I happened to notice this test isn’t using tensor cores, though! So that will be my next check.