Unexpected Reboot

I am currently facing an unexpected reboot of my system when running a pytorch model.

The model just calculates an embeding for 200 images every 1 second. I load the images using cv2 and process them either with a ViT or a VGG16 model.

Both model load to GPU correctly and the code runs for various iterations.

The sytem hardwar is built with the following components:

-1080ti
-2080ti
-Ryzen 7
-850W PSU

I have tested two different models thinking that size of the model or computation complexity of the model might be causing the reboot. The tests ran with a ViT and a VGG16. Unfortunately, the shutdown occurs with both models and it happenes at random time when running the sequence of images.

Ubuntu 22.04
NVIDIA drivers installed:
CUDA12.2 Driver: 535
NVCC --version: cuda 12.2

Pytorch version: 2.2

I am monitoring the gpus with nvidia-smi and none of the cards memory rises more than 4GB used. I am controlling the temperatures and both cards operates at a maximum of 47°C during the tests.

Any suggestions how can I debbug or fix this unexpected reboot?

Thanks in advanced for all responses.

This either sounds like a power issue something like a loose GPU to PSU cable or not a big enough PSU (Although I think yours should be enough) or a CPU temperature issue. It sounds like you built your own PC in which case are you sure your CPU fan is working otherwise your CPU will get really hot and shut down

I would suspect you can repro the shutdown using software like FurMark Homepage and some CPU stress test you can find online.

Thanks I perform the tests by removing one of the GPUs and it worked fine. It seemed to be PSU issue.