I am training a DDP model. However, the system always reboots.
I use a machine with two RTX3090, 10900k CPU – Ubuntu18, CUDA11.0, Nvidia455.38, Pytorch1.7.1, and Python3.8.
I also test my code on another machine with two 2080ti, it runs well.
2 Likes
it is hard to help without more details like code details and etc.
@kaka_zhao Regarding system reboots this could be a system issue rather than a PyTorch issue. Probably check system logs like /var/log/messages
to see why the system rebooted.
I think this is a hardware issue. I use a 1200w PSU for two RTX3090.
After I limited the GPU power from 350w to 250w by nvidia-smi -i 0,1 -pl 250
, the problem has gone away!
1 Like
Limiting GPU power works for me.
A detailed instruction:
# enable persistence mode
sudo nvidia-smi -pm 1
# limite power from 350W to 250W
sudo nvidia-smi -i 0,1,...,3 -pl 250