System reboot when training

I am training a DDP model. However, the system always reboots.
I use a machine with two RTX3090, 10900k CPU – Ubuntu18, CUDA11.0, Nvidia455.38, Pytorch1.7.1, and Python3.8.
I also test my code on another machine with two 2080ti, it runs well.

2 Likes

it is hard to help without more details like code details and etc.

@kaka_zhao Regarding system reboots this could be a system issue rather than a PyTorch issue. Probably check system logs like /var/log/messages to see why the system rebooted.

I think this is a hardware issue. I use a 1200w PSU for two RTX3090.
After I limited the GPU power from 350w to 250w by nvidia-smi -i 0,1 -pl 250, the problem has gone away!

1 Like

Limiting GPU power works for me.

A detailed instruction:

# enable persistence mode
sudo nvidia-smi -pm 1

# limite power from 350W to 250W
sudo nvidia-smi -i 0,1,...,3 -pl 250