[CRASH!] System crash/reboot on RTX 4090

hi all,
i am making this post after banging my head over these past 2 weeks of hell,
my small brain has exhausted all the things that it can think of.

My gpu is asus rtx 4090 overclock edition from tufts gaming
os:ubuntu 22.04 .
processor: intel 2X xeon 4214R CPU
power supply:1650w >> 450 watt which 4090 requires.
ram: 100 Gigs.

for some reason, running pytorch codes on it crashes the whole gpu and causes the machine to reboot. more specifically, training the official repositories of detr and dino detr locally is enough to crash the machine.

here is the debug things i have tried:
[1] ran a simple pytorch code

import torch 
import torch.nn as nn
import torch.optim as optim
x = torch.randn((1000000,700)).cuda()
print("shape", x.shape)
class model(torch.nn.Module):

    def __init__(self):
        super(model, self).__init__()
        self.l = nn.Linear(700,1000)
        self.l2 = nn.Linear(1000,700)
    def forward(self, x):
        x = self.l(x)
        x = self.l2(x)
        return x


m = model().cuda()
loss = nn.MSELoss()
adam = optim.Adam(m.parameters() ,lr=0.001,)

i = 0
while 1:
    print(i)
    i+=1
    out = m(x)
    l = loss(out, out + 1e-7)
    l.backward()
    adam.step()

this achieves 100% gpu utilization and reaches peak temps. this DOES NOT crash

[2] ran gpu burn, memtest for over 12 hours → all passed. eliminates power supply/ram issues.
[3] however, running official detr/dino detr training code causes a lot of crashes.
dinodetr can crash in such 40 training iterations.
[4] machine works fine with other generations of cards like quadro and pascals.

now, this has led me to believe that this is a software issue and not a hardware one since simple benchmarking works.
but, my expertise stops at knowing which precise call reproduces this issue,
so, i would be grateful if someone could please give some suggestions:
[1] is it my machine issue?
[2] is it a nvidia-smi issue or a pytorch issue?
i have tried both latest dev(430) and stable (425) drivers. also, tried cuda 11,11.7,12. all have the same issue.
[3] or can it be my gpu issue?
[4] are other people also facing such troubles?
[5] i dont think it is a temperature issue, since most of the times the crashes happen even if temperature is in range.
@ptrblck and everyone, will be grateful for your guidance,
can you think of anything that might be the cause,
something (api) somewhere is doing something it shoudn’t :frowning: , but i cant think of it,
thanks,
rajat

Note that “100% GPU utilization” doesn’t necessarily imply maximum power/temperature stress so I’m not sure a thermal, power, (or other hardware-related) issue can be ruled out from your test script. Could you try to isolate a runnable portion of the detr/dino training code (e.g., using dummy random tensors instead of the actual data is fine) that reproduces the issue?

In most cases, we would not expect a software issue such as a misbehaving kernel to trigger a whole system crash/reboot.

cant do it since it runs fine for some training iterations and not for other ones, if it was some access issue, it should have crashed in first iteration only…

is there a code which i can run to eliminate hardware issues and is far more simpler.

The reboot points towards a weak or defect PSU and I would recommend swapping it our for a new one, even if temporarily, to check if this would help.

I won’t post the same on the other issues where you have tagged me and assume you will keep the GitHub issues updated.

1 Like

I would power limit 4090 to 250-300w just in case GPU or PSU crash under the stress. You should manually set up fan speed and make sure CPU and GPU are not running at high temperatures. I run it around 250W and fan speed 40-45% with GPU temperature below 60C, CPU below 70C. I choose this setting because it runs almost silent. I also run it with side door open to reduce the temperature.

Hi Rajat, did you solve this?

@eqy @ptrblck @jkdev was never able to solve it. checked on two machines with two separate power supplies. did rma with asus 2 times, and checked 3 separate rtx 4090s versions of same compnay now. still same issue.

gpuburn and all other tests pass.
i am clueless now, and have had multiple people reach out to me.
will be grateful for your suggestions, if anyone can solve it , its you guys

You could check if dmesg or other system logging utils. would give you more information about why your system dies. I’m unaware of any 4090-specfic issues and similar errors reported here were due to the PSU.

I had same issue, but solved it using following method.

  1. Install cuda 12.2.2 version with 535.104.05 driver
  2. Install pytorch nightly latest version(‘2.1.0.dev20230831+cu121’)

I hope this solve your issue.

@winpih Thank you for your suggestion.

May I ask how you installed those specific cuda and driver versions on linux?