How to retry a randomly crashed training task in bash script?

JimW · August 17, 2022, 9:36pm

Hello, I am new to this field. I am launching some pytorch training tasks to server but it crashed randomly at various epochs. The crash might be related to CUDA memory and num_worker>0, and some setup incompatibilities, which makes it difficult to debug. But I was wondering if it is possible to get around by retrying the training in bash script.

My bash script currently looks like this:
apt-get install XXX
cd my folder/
python my_train.py. --save_ckpth_path my_checkpoint_folder

The my_train.py command will save checkpoint files generated from every epoch.
What I hope to do is: Every time the training command crashed, we rerun the same training command but make it start from the latest checkpoint saved as initialization.

Do you think the above idea makes sense? If so, could someone provide some guidance on how to implement it in the script? Thanks.