I am currently facing an unexpected reboot of my system when running a pytorch model.
The model just calculates an embeding for 200 images every 1 second. I load the images using cv2 and process them either with a ViT or a VGG16 model.
Both model load to GPU correctly and the code runs for various iterations.
The sytem hardwar is built with the following components:
-1080ti
-2080ti
-Ryzen 7
-850W PSU
I have tested two different models thinking that size of the model or computation complexity of the model might be causing the reboot. The tests ran with a ViT and a VGG16. Unfortunately, the shutdown occurs with both models and it happenes at random time when running the sequence of images.
Ubuntu 22.04
NVIDIA drivers installed:
CUDA12.2 Driver: 535
NVCC --version: cuda 12.2
Pytorch version: 2.2
I am monitoring the gpus with nvidia-smi and none of the cards memory rises more than 4GB used. I am controlling the temperatures and both cards operates at a maximum of 47°C during the tests.
Any suggestions how can I debbug or fix this unexpected reboot?
Thanks in advanced for all responses.