hi all,
i am making this post after banging my head over these past 2 weeks of hell,
my small brain has exhausted all the things that it can think of.
My gpu is asus rtx 4090 overclock edition from tufts gaming
os:ubuntu 22.04 .
processor: intel 2X xeon 4214R CPU
power supply:1650w >> 450 watt which 4090 requires.
ram: 100 Gigs.
for some reason, running pytorch codes on it crashes the whole gpu and causes the machine to reboot. more specifically, training the official repositories of detr and dino detr locally is enough to crash the machine.
here is the debug things i have tried:
[1] ran a simple pytorch code
import torch
import torch.nn as nn
import torch.optim as optim
x = torch.randn((1000000,700)).cuda()
print("shape", x.shape)
class model(torch.nn.Module):
def __init__(self):
super(model, self).__init__()
self.l = nn.Linear(700,1000)
self.l2 = nn.Linear(1000,700)
def forward(self, x):
x = self.l(x)
x = self.l2(x)
return x
m = model().cuda()
loss = nn.MSELoss()
adam = optim.Adam(m.parameters() ,lr=0.001,)
i = 0
while 1:
print(i)
i+=1
out = m(x)
l = loss(out, out + 1e-7)
l.backward()
adam.step()
this achieves 100% gpu utilization and reaches peak temps. this DOES NOT crash
[2] ran gpu burn, memtest for over 12 hours â all passed. eliminates power supply/ram issues.
[3] however, running official detr/dino detr training code causes a lot of crashes.
dinodetr can crash in such 40 training iterations.
[4] machine works fine with other generations of cards like quadro and pascals.
now, this has led me to believe that this is a software issue and not a hardware one since simple benchmarking works.
but, my expertise stops at knowing which precise call reproduces this issue,
so, i would be grateful if someone could please give some suggestions:
[1] is it my machine issue?
[2] is it a nvidia-smi issue or a pytorch issue?
i have tried both latest dev(430) and stable (425) drivers. also, tried cuda 11,11.7,12. all have the same issue.
[3] or can it be my gpu issue?
[4] are other people also facing such troubles?
[5] i dont think it is a temperature issue, since most of the times the crashes happen even if temperature is in range.
@ptrblck and everyone, will be grateful for your guidance,
can you think of anything that might be the cause,
something (api) somewhere is doing something it shoudnât , but i cant think of it,
thanks,
rajat