Multi GPU (2080 ti) training crashes PC

mstrfx · July 30, 2019, 9:35am

My build:
Asrock z390 extreme4
Intel 8700k
x2 2080 ti x2
Cooler Master v1200 Platinum

Ubuntu 18.04

Cuda 10.0
nccl 2.4.0-2
pytorch was installed according to guide on pytorch.org

So I’ve got something interesting: pc crashes right after I try running imagenet script for multi gpu from official pytorch repository. It doesn’t crash pc if I start training with apex mixed precision. Training on a single 2080 also didn’t cause reboot.

What didn’t work:

decreasing batch size
limiting power consumption of gpu’s via nvidia-smi
changing motherboard, cpu, power supply
changing 2080 ti vendor

For some reason everything worked after I switched both 2080 ti’s with 1080 ti’s. So it seems pytorch (or some nvidia software) isn’t fully compatible with multiple 2080 ti’s? Has anyone encountered this?

ptrblck · July 30, 2019, 9:41pm

Two 2080TIs should work, so I think it might be a hardware issue.
However, it seems you’ve already changed a lot of parts of your system.
Regarding point 3 and 4, it seems you completely rebuilt your system.

Does your code only crash using the ImageNet example or also a very small model, e.g. a single linear layer and DataParallel?

hay · November 13, 2019, 2:13am

I can corroborate. It happens with me too. torch distributed data parallel fails on 2080Ti with >1 gpu, however works well with titan-x, titan-xp or 1080Ti.

AljoSt · April 20, 2020, 9:55am

Same here, fails on two 2080Ti, works with two 1080Ti. In the script, I just load a resnet50 from torchvision.models and do inference on it. It does not crash however, if I run the same script twice at the same time, each using one GPU. So it does not seem to a power supply issue

AljoSt · April 20, 2020, 1:42pm

Okay, so now I have done some runs with the following code:

import os

from tqdm import tqdm
from torchvision.models import resnet50, resnet18
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import torch
import torch.nn as nn

n = 1000000

class RandomDs(Dataset):
    def __init__(self, ):
        pass

    def __len__(self):
        return n

    def __getitem__(self, index):
        return torch.rand(3, 512, 256)


if __name__ == '__main__':

    dataset = RandomDs()
    data_loader = DataLoader(dataset, batch_size=128, shuffle=False, num_workers=4, pin_memory=False)


    model = resnet50()

    # os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    device = torch.device('cuda:0')
    model = nn.DataParallel(model)
    
    model.to(device)

    with torch.no_grad():
        for batch in tqdm(data_loader):
            batch = batch.to(device)
            model(batch)

I changed the input shape to the network, batch size, number of workers, model type, torch version, and whether it runs with DataParallel or as two scripts (so two scripts, running at the same time, so that both GPUs are in use). These are the results:

input shape	batch_size	num_workers	model	parallel	torch version	crash	memory (GB each)
3, 256, 256	128	4	resnet50	yes	1.2.0	No	~2.3
3, 512, 256	128	4	resnet50	yes	1.2.0	Yes	~3.3
3, 256, 256	256	4	resnet50	yes	1.2.0	Yes	~3.3
3, 256, 256	256	4	resnet50	two scripts	1.2.0	No	~5.3 (single)
3, 256, 256	256	16	resnet50	yes	1.2.0	Yes	~3.3
3, 256, 256	256	4	resnet18	yes	1.2.0	Yes	~2.2
3, 512, 256	128	4	resnet50	yes	1.4.0	Yes	~3.3

It seems to be memory related, but given that I am using 2*2080Ti and I have 64GB ram, it’s not an OOM. Any ideas what else I can test?

I am using cuda 10.2 and tried pytroch version 1.2 and 1.4

ptrblck · April 20, 2020, 2:13pm

Does your machine restart or how does the crash look for you?
If that’s the case, was PyTorch (or other libs) working before or is it a new setup?
Did you run some stress tests on the devices?

AljoSt · April 20, 2020, 2:27pm

The machine shuts down immediately and then restarts. It is more or less a new system: It has been used for other tasks before, but not for ML. I’ve run GPU-burn for an hour with Double precision and tensor cores enabled and it seemed to work fine.

The only times when it doesn’t crash is when I am using a small input (shape/bs) or when I’m not using DataParallel. Could this be related to the communication between the GPUs?

ptrblck · April 20, 2020, 10:34pm

You could run e.g. nccl-test to check the communication.

AljoSt · April 21, 2020, 12:51pm

Seems fine:

I’m really out of ideas on what to test. The only way I can replicate the crashes or find any abnormalities is by using PyTorch DataParallel. GPU burn runs fine, nccl-test runs fine, utilizing both GPUs at the same time individually is fine. What else could I do in order to identify the cause of the problem?

pritamdamania87 · April 21, 2020, 7:51pm

Can you check system logs to see if there is something there mentioning why the machine restarted? Usually on Linux there might be some information in /var/log/messages.

Another option might be to post on Nvidia forums to check if there might be some system issue.

ptrblck · April 22, 2020, 7:11am

That’s another good idea.
@AljoSt try to grab for Xid and post the result here, please.

AljoSt · April 23, 2020, 10:43am

Could not find anything interesting in kernlog, syslog or dmesg. But what I did find was this github issue where people have a similar problem. GPU burn etc cannot make the system crash and PSU is powerful enough (on paper). They say, that PyTorch causes a short big power surge which can cause the PSU to fail, even if it’s big enough to support the system on heavy load in other circumstances (gpu burn etc).

I put the 2 2080Tis into a system with a 1600w PSU and they seemed to work fine. The PSU in the other system had 850w. Obv. I cannot guarantee that this is the cause/solution due to many variables being changed but as other people have had similar experiences, I will take it for now (and will post again here if there are new developments). Thanks for your help!

(and btw, since many people have the same problem which seems to originate in PyTorch, I think it would be good if the relevant developers could have a look into that)

ptrblck · April 24, 2020, 6:56am

Since a 2080Ti can take >300W, a 850W PSU might not be enough for the system.
I’m not sure, if artificially limiting the GPU is a valid use case.

AljoSt · April 24, 2020, 8:29am

Ah interesting, I didn’t know that, thank you. I have difficulties finding the max power draw (the highest I saw in some random blog post was around 330w). Is it vendor specific (so would it differ say between MSI 2080Ti and EVGA 2080TI?) or is there a “fixed” number?

ptrblck · April 24, 2020, 8:34am

It should be vendor specific, as some vendors might increase the clock speeds and add a better cooling solution to the device. You won’t see much difference, but it’s not strictly a single number.

AljoSt · April 24, 2020, 9:18am

Thank you for your help!

mstrfx · April 24, 2020, 10:36am

hi guys

Just wanted to let you know it is not related to only pytorch framework. Happens also with tensorflow.

It feels like modern power supplies can’t handle 300w + 300w simultaneously during training. I got double 2080 ti with another power supply (with cheap 2$ synchronizer) and everything works like a charm.

VincentWang · October 18, 2021, 12:04am

Hi mstrfx, may I ask which PSU did you buy? I am facing the same issue with a 1200 watt PSU.

VincentWang · October 23, 2021, 2:38am

It must be the PSU…

So I have 2 2080Ti GPU (from EVGA) and CORSAIR - HX Series 1200W ATX12V 2.4/EPS12V 2.92 80 Plus Platinum Modular Power Supply. It shuts down the PC sometimes when I run 2 gpu for training (separately or together) and it usually happens when one model reaches the epoch of that fold.

Now I switch to EVGA - GP Series SuperNOVA 1000W ATX 80 Plus Gold Fully Modular Power Supply. It works well without any PC shuts down. I made the switch because my friend has the same setting as mine and he used this PSU. It turns out that it is indeed the case but it is still surprised to see 1200W cannot work but 1000W can.

Not sure if it was a defect for the 1200W PSU though.

3co · December 18, 2023, 12:09am

Been trying to run 2 RTX 3090’s on a brand new Corsair 1500W PSU… with many sudden shutdowns. I’m thinking there’s no way I don’t have enough watts, right?

I was worried it could be a temperature issue, so was obsessively logging temps for all devices, and running fans at max speed, right up to shutdown. GPUs were a “cool” ~60 C – not enough to warrant causing the shutdown.

What folks say above about potentially a single PSU “not being able to handle both 300w + 300w during training” due to sudden bursts of power then makes a lot of sense.

Thanks to @AljoSt pointing to the relevant github issue, where at the end @lfcnassif (Luis Filipe Nassif) · GitHub says that his team did experiments, and found that what successfully worked was to limit clock speeds of both GPUs.

The trick was simply running:
sudo nvidia-smi -lgc 1395,1740

(where the -lgc sets min, max clock speeds)

It turns out that max clock speed is not much lower than the ~1900 MHz that the GPU would naturally run at, but basically I think we’re just preventing sudden synchronous large bursts of computational hunger…

And after training a run for a long time now, I’m posting “confirmation”, this works.