Server with dual 3090 GPU crashes

danohev · January 27, 2021, 3:41pm

Hi all!

I have a server with 2 x RTX 3090 GPUs.
Installed Nvidia driver 460, CUDA 11.1, PyTorch nightly (1.8), on Ubuntu 20 and tried running deep learning benchmarks.

The problem is everything runs fine if I use a single GPU.
But the moment when I run both of them, the PC just shuts off.

I tried using a stress test that loaded both GPUs 100% utilization and it worked fine without crashing.
I tried limiting the power of GPUs to 200W (using ‘sudo nvidia-smi -pl 200’ command), started the pytorch training script and it crashed again

so I guess it isn’t power supply issue (it’s a SilverStone 1500 watt power supply)

here are the code lines I use for using the 2GPUs:

model = models.resnet152(pretrained=False)
model.conv1 = nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
model.fc = nn.Linear(2048, 2)

if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
model.to(device)

for input_images, labels in dataloaders[‘train’]:
# Enable CUDA: use GPUs for model computation
input_images, labels = input_images.to(device), labels.to(device)

don’t know how to proceed with this…
help needed.

thanks!

J_Johnson · January 29, 2021, 1:50pm

Have you tried with CUDA 11.0 and the corresponding drivers?

danohev · January 30, 2021, 6:47pm

no, just CUDA 11.1.
I was told thig might still be PSU issue.
even though the stress test with 100% utilization for both 2GPUs passes correctly, when running the pytorch training the GPUs+CPU might have big power spikes which the PSU cant handle (1500Wat, rated ‘80 PLUS Silver’).
I got the recommendation to get a PSU with ‘80 platinum’ or ‘80 titanium’ rating

what do you think?

J_Johnson · January 31, 2021, 12:47am

Any PSU above 1200W usually requires that the incoming voltage be of a higher rating to achieve the max stated watt rating of the PSU. What is the AC input voltage rating coming into the PSU? (This can vary by country.)

J_Johnson · January 31, 2021, 12:53am

By way of example, CORSAIR AXi Series AX1500i Digital 1500W 80 PLUS TITANIUM Haswell Ready Full Modular ATX12V & EPS12V SLI and Crossfire Ready Power Supply with C-Link Monitoring and Control - Newegg.com

This PSU is rated 1500W, but if the incoming power supply is 100-115V, it will only provide 1300W.

danohev · January 31, 2021, 6:11am

we supply 220V AC volatge

Ludwig_Friborg · June 11, 2021, 7:18am

Any follow up on this? I got a similar situation with pretty much the same rig. The PSU power draw is about 600W when the rig crashes.

Hiroyuki_Onishi · May 17, 2022, 2:36am

I’m running into the similar symptom.
Did you find any solutions on this?

danohev · May 17, 2022, 7:50am

yes, changing the PSU as suggested solved the problem.
it think it has ‘80 platinum’ or ‘80 titanium’ rating