Multi GPU (2080 ti) training crashes PC

My build:
Asrock z390 extreme4
Intel 8700k
x2 2080 ti x2
Cooler Master v1200 Platinum

Ubuntu 18.04

Cuda 10.0
nccl 2.4.0-2
pytorch was installed according to guide on pytorch.org

So I’ve got something interesting: pc crashes right after I try running imagenet script for multi gpu from official pytorch repository. It doesn’t crash pc if I start training with apex mixed precision. Training on a single 2080 also didn’t cause reboot.

What didn’t work:

  1. decreasing batch size
  2. limiting power consumption of gpu’s via nvidia-smi
  3. changing motherboard, cpu, power supply
  4. changing 2080 ti vendor

For some reason everything worked after I switched both 2080 ti’s with 1080 ti’s. So it seems pytorch (or some nvidia software) isn’t fully compatible with multiple 2080 ti’s? Has anyone encountered this?

Two 2080TIs should work, so I think it might be a hardware issue.
However, it seems you’ve already changed a lot of parts of your system.
Regarding point 3 and 4, it seems you completely rebuilt your system.

Does your code only crash using the ImageNet example or also a very small model, e.g. a single linear layer and DataParallel?

I can corroborate. It happens with me too. torch distributed data parallel fails on 2080Ti with >1 gpu, however works well with titan-x, titan-xp or 1080Ti.