Training on CPU abruptly quits without error or warnings

EvanZ · July 25, 2022, 10:41pm

Having something weird happening where my code will run for over an epoch but then just abruptly stop without giving any errors or warnings. I’m not sure what to look for. For example you can see below it just suddenly stopped training at the 3rd batch in Epoch 1:

Size of validation set: 6214
/home/ubuntu/deep-behavior-embedding/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:611: UserWarning: Checkpoint directory /home/ubuntu/jira/4405/model/pure_premium exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")

  | Name           | Type                        | Params
---------------------------------------------------------------
0 | base_model     | LitModel                    | 2.9 M 
1 | linear         | Linear                      | 101   
2 | valid_deviance | TweedieDevianceScore        | 0     
3 | test_deviance  | TweedieDevianceScore        | 0     
4 | valid_mape     | MeanAbsolutePercentageError | 0     
---------------------------------------------------------------
2.9 M     Trainable params
0         Non-trainable params
2.9 M     Total params
11.476    Total estimated model params size (MB)
Epoch 1:   0%|                 | 3/972 [00:16<1:27:38,  5.43s/it, loss=225, v_num=17]
(.venv) ubuntu@ip-172-18-81-185:~/deep-behavior-embedding$

ptrblck · July 25, 2022, 11:45pm

Could you rerun the use case with gdb and check if you would see any error?
I would expect to see at least an error message, so unsure if the script is really failing or if it’s running into a check and just exits (e.g. exit after 975 batches).

EvanZ · July 25, 2022, 11:53pm

@ptrblck I’ve never used gdb. Would you know of a good tutorial to help me get started with that?

ptrblck · July 25, 2022, 11:55pm

This might be a good starter.
However, as a quick TL;DR run this:

gdb --args python script.py args
...
run
...
# if it crashes here
bt

and post the backtrace here.
If the code never crashes but exits normally, you would need to check in the actual Python script where the execution exits.

EvanZ · July 26, 2022, 5:58pm

Hmm I tried this and am just getting:

GNU gdb (Ubuntu 12.0.90-0ubuntu1) 12.0.90
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)

ptrblck · July 26, 2022, 6:25pm

This is the expected startup message, after which you would need to type run into the gdb session.

EvanZ · August 8, 2022, 6:19pm

I ended up just trying to run my training code on a bigger cpu box on EC2 (a c5.24xlarge as opposed to a t3.2xlarge) and now it runs successfully to completion. I also discovered DeviceStatsMonitor callback in pytorch-lightning and will see if that might be useful as a diagnostic tool.

EvanZ · August 8, 2022, 9:25pm

You can see here an example where the CPU% suddenly shoots up and at the end of this the training just quit midway through the first epoch.

ptrblck · August 8, 2022, 10:38pm

I can’t speculate on the root cause and think a backtrace might still be the proper way to narrow down what exactly is causing the issue, e.g. as it can just be a simple OOM on the host.