Asrock z390 extreme4
x2 2080 ti x2
Cooler Master v1200 Platinum
pytorch was installed according to guide on pytorch.org
So I’ve got something interesting: pc crashes right after I try running imagenet script for multi gpu from official pytorch repository. It doesn’t crash pc if I start training with apex mixed precision. Training on a single 2080 also didn’t cause reboot.
What didn’t work:
- decreasing batch size
- limiting power consumption of gpu’s via nvidia-smi
- changing motherboard, cpu, power supply
- changing 2080 ti vendor
For some reason everything worked after I switched both 2080 ti’s with 1080 ti’s. So it seems pytorch (or some nvidia software) isn’t fully compatible with multiple 2080 ti’s? Has anyone encountered this?
Two 2080TIs should work, so I think it might be a hardware issue.
However, it seems you’ve already changed a lot of parts of your system.
Regarding point 3 and 4, it seems you completely rebuilt your system.
Does your code only crash using the ImageNet example or also a very small model, e.g. a single linear layer and
I can corroborate. It happens with me too. torch distributed data parallel fails on 2080Ti with >1 gpu, however works well with titan-x, titan-xp or 1080Ti.
Same here, fails on two 2080Ti, works with two 1080Ti. In the script, I just load a
torchvision.models and do inference on it. It does not crash however, if I run the same script twice at the same time, each using one GPU. So it does not seem to a power supply issue
Okay, so now I have done some runs with the following code:
from tqdm import tqdm
from torchvision.models import resnet50, resnet18
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import torch.nn as nn
n = 1000000
def __init__(self, ):
def __getitem__(self, index):
return torch.rand(3, 512, 256)
if __name__ == '__main__':
dataset = RandomDs()
data_loader = DataLoader(dataset, batch_size=128, shuffle=False, num_workers=4, pin_memory=False)
model = resnet50()
# os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = torch.device('cuda:0')
model = nn.DataParallel(model)
for batch in tqdm(data_loader):
batch = batch.to(device)
I changed the
input shape to the network,
number of workers,
torch version, and whether it runs with
DataParallel or as two scripts (so two scripts, running at the same time, so that both GPUs are in use). These are the results:
||memory (GB each)
|3, 256, 256
|3, 512, 256
|3, 256, 256
|3, 256, 256
|3, 256, 256
|3, 256, 256
|3, 512, 256
It seems to be memory related, but given that I am using 2*2080Ti and I have 64GB ram, it’s not an OOM. Any ideas what else I can test?
I am using cuda 10.2 and tried pytroch version 1.2 and 1.4
Does your machine restart or how does the crash look for you?
If that’s the case, was PyTorch (or other libs) working before or is it a new setup?
Did you run some stress tests on the devices?
The machine shuts down immediately and then restarts. It is more or less a new system: It has been used for other tasks before, but not for ML. I’ve run GPU-burn for an hour with Double precision and tensor cores enabled and it seemed to work fine.
The only times when it doesn’t crash is when I am using a small input (shape/bs) or when I’m not using
DataParallel. Could this be related to the communication between the GPUs?
You could run e.g. nccl-test to check the communication.
I’m really out of ideas on what to test. The only way I can replicate the crashes or find any abnormalities is by using PyTorch DataParallel. GPU burn runs fine, nccl-test runs fine, utilizing both GPUs at the same time individually is fine. What else could I do in order to identify the cause of the problem?
Can you check system logs to see if there is something there mentioning why the machine restarted? Usually on Linux there might be some information in
Another option might be to post on Nvidia forums to check if there might be some system issue.
That’s another good idea.
@AljoSt try to grab for
Xid and post the result here, please.
Could not find anything interesting in kernlog, syslog or dmesg. But what I did find was this github issue where people have a similar problem. GPU burn etc cannot make the system crash and PSU is powerful enough (on paper). They say, that PyTorch causes a short big power surge which can cause the PSU to fail, even if it’s big enough to support the system on heavy load in other circumstances (gpu burn etc).
I put the 2 2080Tis into a system with a 1600w PSU and they seemed to work fine. The PSU in the other system had 850w. Obv. I cannot guarantee that this is the cause/solution due to many variables being changed but as other people have had similar experiences, I will take it for now (and will post again here if there are new developments). Thanks for your help!
(and btw, since many people have the same problem which seems to originate in PyTorch, I think it would be good if the relevant developers could have a look into that)
Since a 2080Ti can take >300W, a 850W PSU might not be enough for the system.
I’m not sure, if artificially limiting the GPU is a valid use case.
Ah interesting, I didn’t know that, thank you. I have difficulties finding the max power draw (the highest I saw in some random blog post was around 330w). Is it vendor specific (so would it differ say between MSI 2080Ti and EVGA 2080TI?) or is there a “fixed” number?
It should be vendor specific, as some vendors might increase the clock speeds and add a better cooling solution to the device. You won’t see much difference, but it’s not strictly a single number.
Just wanted to let you know it is not related to only pytorch framework. Happens also with tensorflow.
It feels like modern power supplies can’t handle 300w + 300w simultaneously during training. I got double 2080 ti with another power supply (with cheap 2$ synchronizer) and everything works like a charm.
Hi mstrfx, may I ask which PSU did you buy? I am facing the same issue with a 1200 watt PSU.
It must be the PSU…
So I have 2 2080Ti GPU (from EVGA) and CORSAIR - HX Series 1200W ATX12V 2.4/EPS12V 2.92 80 Plus Platinum Modular Power Supply. It shuts down the PC sometimes when I run 2 gpu for training (separately or together) and it usually happens when one model reaches the epoch of that fold.
Now I switch to EVGA - GP Series SuperNOVA 1000W ATX 80 Plus Gold Fully Modular Power Supply. It works well without any PC shuts down. I made the switch because my friend has the same setting as mine and he used this PSU. It turns out that it is indeed the case but it is still surprised to see 1200W cannot work but 1000W can.
Not sure if it was a defect for the 1200W PSU though.