Having something weird happening where my code will run for over an epoch but then just abruptly stop without giving any errors or warnings. I’m not sure what to look for. For example you can see below it just suddenly stopped training at the 3rd batch in Epoch 1:
Size of validation set: 6214
/home/ubuntu/deep-behavior-embedding/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:611: UserWarning: Checkpoint directory /home/ubuntu/jira/4405/model/pure_premium exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
| Name | Type | Params
---------------------------------------------------------------
0 | base_model | LitModel | 2.9 M
1 | linear | Linear | 101
2 | valid_deviance | TweedieDevianceScore | 0
3 | test_deviance | TweedieDevianceScore | 0
4 | valid_mape | MeanAbsolutePercentageError | 0
---------------------------------------------------------------
2.9 M Trainable params
0 Non-trainable params
2.9 M Total params
11.476 Total estimated model params size (MB)
Epoch 1: 0%| | 3/972 [00:16<1:27:38, 5.43s/it, loss=225, v_num=17]
(.venv) ubuntu@ip-172-18-81-185:~/deep-behavior-embedding$
Could you rerun the use case with gdb and check if you would see any error?
I would expect to see at least an error message, so unsure if the script is really failing or if it’s running into a check and just exits (e.g. exit after 975 batches).
This might be a good starter.
However, as a quick TL;DR run this:
gdb --args python script.py args
...
run
...
# if it crashes here
bt
and post the backtrace here.
If the code never crashes but exits normally, you would need to check in the actual Python script where the execution exits.
GNU gdb (Ubuntu 12.0.90-0ubuntu1) 12.0.90
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)
I ended up just trying to run my training code on a bigger cpu box on EC2 (a c5.24xlarge as opposed to a t3.2xlarge) and now it runs successfully to completion. I also discovered DeviceStatsMonitor callback in pytorch-lightning and will see if that might be useful as a diagnostic tool.
I can’t speculate on the root cause and think a backtrace might still be the proper way to narrow down what exactly is causing the issue, e.g. as it can just be a simple OOM on the host.