I am facing a strange problem. When running a code on two GPUs my machine, after some epochs, reboots. If I use the same code running only on 1 GPU everything works fine.
I have follow this old thread nn.DataParallel(model).cuda() stuck but it does not work.
Is your power supply powerful enough to feed them?
If it’s on the limit, it’s a matter of time to have a power peak. In that case, power supply just disconnect current and that causes a “reboot”
Oh, and realise official gpu comsumption is an average rather than a maximum. You may have peaks 1.6 times bigger than specifications