Multi GPU (2080 ti) training crashes PC

My build:
Asrock z390 extreme4
Intel 8700k
x2 2080 ti x2
Cooler Master v1200 Platinum

Ubuntu 18.04

Cuda 10.0
nccl 2.4.0-2
pytorch was installed according to guide on pytorch.org

So I’ve got something interesting: pc crashes right after I try running imagenet script for multi gpu from official pytorch repository. It doesn’t crash pc if I start training with apex mixed precision. Training on a single 2080 also didn’t cause reboot.

What didn’t work:

  1. decreasing batch size
  2. limiting power consumption of gpu’s via nvidia-smi
  3. changing motherboard, cpu, power supply
  4. changing 2080 ti vendor

For some reason everything worked after I switched both 2080 ti’s with 1080 ti’s. So it seems pytorch (or some nvidia software) isn’t fully compatible with multiple 2080 ti’s? Has anyone encountered this?

Two 2080TIs should work, so I think it might be a hardware issue.
However, it seems you’ve already changed a lot of parts of your system.
Regarding point 3 and 4, it seems you completely rebuilt your system.

Does your code only crash using the ImageNet example or also a very small model, e.g. a single linear layer and DataParallel?

I can corroborate. It happens with me too. torch distributed data parallel fails on 2080Ti with >1 gpu, however works well with titan-x, titan-xp or 1080Ti.

Same here, fails on two 2080Ti, works with two 1080Ti. In the script, I just load a resnet50 from torchvision.models and do inference on it. It does not crash however, if I run the same script twice at the same time, each using one GPU. So it does not seem to a power supply issue

Okay, so now I have done some runs with the following code:

import os

from tqdm import tqdm
from torchvision.models import resnet50, resnet18
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import torch
import torch.nn as nn

n = 1000000

class RandomDs(Dataset):
    def __init__(self, ):
        pass

    def __len__(self):
        return n

    def __getitem__(self, index):
        return torch.rand(3, 512, 256)


if __name__ == '__main__':

    dataset = RandomDs()
    data_loader = DataLoader(dataset, batch_size=128, shuffle=False, num_workers=4, pin_memory=False)


    model = resnet50()

    # os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    device = torch.device('cuda:0')
    model = nn.DataParallel(model)
    
    model.to(device)

    with torch.no_grad():
        for batch in tqdm(data_loader):
            batch = batch.to(device)
            model(batch)

I changed the input shape to the network, batch size, number of workers, model type, torch version, and whether it runs with DataParallel or as two scripts (so two scripts, running at the same time, so that both GPUs are in use). These are the results:

input shape batch_size num_workers model parallel torch version crash memory (GB each)
3, 256, 256 128 4 resnet50 yes 1.2.0 No ~2.3
3, 512, 256 128 4 resnet50 yes 1.2.0 Yes ~3.3
3, 256, 256 256 4 resnet50 yes 1.2.0 Yes ~3.3
3, 256, 256 256 4 resnet50 two scripts 1.2.0 No ~5.3 (single)
3, 256, 256 256 16 resnet50 yes 1.2.0 Yes ~3.3
3, 256, 256 256 4 resnet18 yes 1.2.0 Yes ~2.2
3, 512, 256 128 4 resnet50 yes 1.4.0 Yes ~3.3

It seems to be memory related, but given that I am using 2*2080Ti and I have 64GB ram, it’s not an OOM. Any ideas what else I can test?

I am using cuda 10.2 and tried pytroch version 1.2 and 1.4

Does your machine restart or how does the crash look for you?
If that’s the case, was PyTorch (or other libs) working before or is it a new setup?
Did you run some stress tests on the devices?

The machine shuts down immediately and then restarts. It is more or less a new system: It has been used for other tasks before, but not for ML. I’ve run GPU-burn for an hour with Double precision and tensor cores enabled and it seemed to work fine.

The only times when it doesn’t crash is when I am using a small input (shape/bs) or when I’m not using DataParallel. Could this be related to the communication between the GPUs?

You could run e.g. nccl-test to check the communication.

Seems fine:

I’m really out of ideas on what to test. The only way I can replicate the crashes or find any abnormalities is by using PyTorch DataParallel. GPU burn runs fine, nccl-test runs fine, utilizing both GPUs at the same time individually is fine. What else could I do in order to identify the cause of the problem?

Can you check system logs to see if there is something there mentioning why the machine restarted? Usually on Linux there might be some information in /var/log/messages.

Another option might be to post on Nvidia forums to check if there might be some system issue.

That’s another good idea.
@AljoSt try to grab for Xid and post the result here, please.

Could not find anything interesting in kernlog, syslog or dmesg. But what I did find was this github issue where people have a similar problem. GPU burn etc cannot make the system crash and PSU is powerful enough (on paper). They say, that PyTorch causes a short big power surge which can cause the PSU to fail, even if it’s big enough to support the system on heavy load in other circumstances (gpu burn etc).

I put the 2 2080Tis into a system with a 1600w PSU and they seemed to work fine. The PSU in the other system had 850w. Obv. I cannot guarantee that this is the cause/solution due to many variables being changed but as other people have had similar experiences, I will take it for now (and will post again here if there are new developments). Thanks for your help!

(and btw, since many people have the same problem which seems to originate in PyTorch, I think it would be good if the relevant developers could have a look into that)

Since a 2080Ti can take >300W, a 850W PSU might not be enough for the system.
I’m not sure, if artificially limiting the GPU is a valid use case.

Ah interesting, I didn’t know that, thank you. I have difficulties finding the max power draw (the highest I saw in some random blog post was around 330w). Is it vendor specific (so would it differ say between MSI 2080Ti and EVGA 2080TI?) or is there a “fixed” number?

It should be vendor specific, as some vendors might increase the clock speeds and add a better cooling solution to the device. You won’t see much difference, but it’s not strictly a single number.

Thank you for your help!

hi guys :smiley:

Just wanted to let you know it is not related to only pytorch framework. Happens also with tensorflow.

It feels like modern power supplies can’t handle 300w + 300w simultaneously during training. I got double 2080 ti with another power supply (with cheap 2$ synchronizer) and everything works like a charm.