Training on CPU abruptly quits without error or warnings

Having something weird happening where my code will run for over an epoch but then just abruptly stop without giving any errors or warnings. I’m not sure what to look for. For example you can see below it just suddenly stopped training at the 3rd batch in Epoch 1:

Size of validation set: 6214
/home/ubuntu/deep-behavior-embedding/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:611: UserWarning: Checkpoint directory /home/ubuntu/jira/4405/model/pure_premium exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")

  | Name           | Type                        | Params
---------------------------------------------------------------
0 | base_model     | LitModel                    | 2.9 M 
1 | linear         | Linear                      | 101   
2 | valid_deviance | TweedieDevianceScore        | 0     
3 | test_deviance  | TweedieDevianceScore        | 0     
4 | valid_mape     | MeanAbsolutePercentageError | 0     
---------------------------------------------------------------
2.9 M     Trainable params
0         Non-trainable params
2.9 M     Total params
11.476    Total estimated model params size (MB)
Epoch 1:   0%|                 | 3/972 [00:16<1:27:38,  5.43s/it, loss=225, v_num=17]
(.venv) ubuntu@ip-172-18-81-185:~/deep-behavior-embedding$ 

Could you rerun the use case with gdb and check if you would see any error?
I would expect to see at least an error message, so unsure if the script is really failing or if it’s running into a check and just exits (e.g. exit after 975 batches).

@ptrblck I’ve never used gdb. Would you know of a good tutorial to help me get started with that?

This might be a good starter.
However, as a quick TL;DR run this:

gdb --args python script.py args
...
run
...
# if it crashes here
bt

and post the backtrace here.
If the code never crashes but exits normally, you would need to check in the actual Python script where the execution exits.

1 Like

Hmm I tried this and am just getting:

GNU gdb (Ubuntu 12.0.90-0ubuntu1) 12.0.90
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)

This is the expected startup message, after which you would need to type run into the gdb session.

1 Like

I ended up just trying to run my training code on a bigger cpu box on EC2 (a c5.24xlarge as opposed to a t3.2xlarge) and now it runs successfully to completion. I also discovered DeviceStatsMonitor callback in pytorch-lightning and will see if that might be useful as a diagnostic tool.

You can see here an example where the CPU% suddenly shoots up and at the end of this the training just quit midway through the first epoch.

I can’t speculate on the root cause and think a backtrace might still be the proper way to narrow down what exactly is causing the issue, e.g. as it can just be a simple OOM on the host.