Computer shut down on loading

David-G · April 30, 2025, 11:12pm

I’ve been struggling to solve computer hard shutdown issue. When I run any application that calls PyTorch and Cuda, as the main LLM is about to load into vram, my computer just shuts off. It is as if is some overload protection is triggered and it just hard crashes, but motherboard has power i.e. network adapter still has power, but needs a power pull and plug in to reboot.

All hardware and software is up to date and full stress tested. 64G or EEC ram, NVIDIA A5000 24 gig GPU.

Example on ComfyUI, all the Unet, LoRA, VAE all loads. When it -----Start infer----- it may run for a few seconds, then crash. It happens so fast, nothing can write to any type of logs I have tried to grather.

This only happens when related to PyTorch type of operations. I can run 36 hours of AEC type of renders, etc on the machine and never once a crash.

I had 2.6&cu12.6, now 2.7&cu12.8 = same crash.

I have tried all sizes of checkpoints, floats, FB’s, models, etc as well, even ones that use low-vram down to 12g vram… still the same.

Last 3 lines of last crash:

[2025-04-29 07:41:35.734] Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference.
[2025-04-29 07:41:35.735] Requested to load AutoencodingEngine
[2025-04-29 07:41:35.900] loaded completely 20147.426822662354 186.42292404174805 True

Any assistance would be highly appreciated.

ptrblck · May 1, 2025, 12:00pm

This sounds like a system issue and often an underspec’s PSU can cause sudden reboots.

David-G · May 1, 2025, 12:18pm

I’ve thought of that as well, The system has a Seasonic 700w PSU with the measured max GPU draw of 231w and all of MB, CPU, ram, m.2 drives, drawing another 250w. With 481w max measured under extended full load tests. 219w available for spikes of any kind, with with a Titanium rated PUS with 94% efficiency rating, you would think there is enough head room.

This does not rule out why it only happens with one AI related task. The machine otherwise has been under a few years of heavy use with no issues, with max GPU sustained usage for processor and Vram for days at a time using windows native applications.

Could it be something in PyTourch, that when a model is called in the process, it does not recognize the limits of the GPU and tries to dump it completely into Vram, overloading the GPU? Shouldn’t caching and CPU take over to cover to cover this overload, if the GPU Vram is maxed out?

I did just request a RMA on my current PSU, added 300w more, newer model and will see if this solves the issue.

ptrblck · May 1, 2025, 1:29pm

I don’t understand what “overloading the GPU” means here. PyTorch is known to be a performant framework which is able to push the utilization of the hardware quite a bit. Other topics in this discussion board explained similar situations where the PSU wasn’t suitable to sustain the load.

Assuming “overload” means PyTorch tries to max. out the GPU global memory, a plain OOM error will be thrown and the system won’t crash:

x = torch.randn(1024**5, device="cuda")
# OutOfMemoryError: CUDA out of memory. Tried to allocate 4194304.00 GiB. ...

David-G · May 12, 2025, 5:16pm

@ptrblck SeaSonic came through and RMA’d a 700w fanless PSU that was produced way before a known overly sensitive Anti-Surge issues date. Replaced with a new Vertex PX-1000, with a 2023 manufacturing date that is a little “looser” on the anti-surge. Several days of burn-in and rigorous AI stress testing, and have not had a single crash yet.

Can mark this one as SOLVED and thank you.

ptrblck · May 12, 2025, 5:19pm

Good to hear and thanks for the update!