Best way to track potential reasons for crashes during training


I am training a rather large model with pytorch (Estimated forward/backward around 10 GB), and I am facing the problem that the training sometimes crashes at around ~100/800Epochs (computer shuts down). I am aware there can be multiple reasons like the GPU overheating, CPU getting trashed etc. So i was wondering what the best metrics would be to track down the issue, and if possible how to track these metrics efficently e.g. what liabries/commands to use.

For instance tracking the cpu usage is rather hard, since it is mostly used when loading images with the dataloader, so i am not sure where to call it. Looking forward for ideas!

Kind Regards

I would first check logs, e.g. in dmesg, to see if the system reported any errors or why it’s shutting down. It would also be interesting to learn more about this “shut down” behavior. Is the system reporting any errors before shutting itself down or is it just crashing and turns directly off? The latter case is often caused by an underpowered or defect PSU, so you might want to swap it (even temporarily for debugging) and see if this helps.

1 Like