Server with dual 3090 GPU crashes

Hi all!

I have a server with 2 x RTX 3090 GPUs.
Installed Nvidia driver 460, CUDA 11.1, PyTorch nightly (1.8), on Ubuntu 20 and tried running deep learning benchmarks.

The problem is everything runs fine if I use a single GPU.
But the moment when I run both of them, the PC just shuts off.

  1. I tried using a stress test that loaded both GPUs 100% utilization and it worked fine without crashing.
  2. I tried limiting the power of GPUs to 200W (using ‘sudo nvidia-smi -pl 200’ command), started the pytorch training script and it crashed again

so I guess it isn’t power supply issue (it’s a SilverStone 1500 watt power supply)

here are the code lines I use for using the 2GPUs:

model = models.resnet152(pretrained=False)
model.conv1 = nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
model.fc = nn.Linear(2048, 2)

if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
model.to(device)

for input_images, labels in dataloaders[‘train’]:
# Enable CUDA: use GPUs for model computation
input_images, labels = input_images.to(device), labels.to(device)

don’t know how to proceed with this…
help needed.

thanks!

Have you tried with CUDA 11.0 and the corresponding drivers?

no, just CUDA 11.1.
I was told thig might still be PSU issue.
even though the stress test with 100% utilization for both 2GPUs passes correctly, when running the pytorch training the GPUs+CPU might have big power spikes which the PSU cant handle (1500Wat, rated ‘80 PLUS Silver’).
I got the recommendation to get a PSU with ‘80 platinum’ or ‘80 titanium’ rating

what do you think?

Any PSU above 1200W usually requires that the incoming voltage be of a higher rating to achieve the max stated watt rating of the PSU. What is the AC input voltage rating coming into the PSU? (This can vary by country.)

By way of example, CORSAIR AXi Series AX1500i Digital 1500W 80 PLUS TITANIUM Haswell Ready Full Modular ATX12V & EPS12V SLI and Crossfire Ready Power Supply with C-Link Monitoring and Control - Newegg.com

This PSU is rated 1500W, but if the incoming power supply is 100-115V, it will only provide 1300W.

we supply 220V AC volatge

Any follow up on this? I got a similar situation with pretty much the same rig. The PSU power draw is about 600W when the rig crashes.

I’m running into the similar symptom.
Did you find any solutions on this?

yes, changing the PSU as suggested solved the problem.
it think it has ‘80 platinum’ or ‘80 titanium’ rating

1 Like