Training Stops with exit code 0

tsuijenk · January 19, 2022, 9:20pm

Hello everyone,

My training suddenly stops after a certain epochs, after training step completes, before the validation steps.

I am using PyTorch Lightning, single GPU. There was no error return in interface. All I found was in Weight-and-Bias Log, which states Exit Code 0 (which is a normal shutdown?). Before this exit code, I also saw multiprocessing spawn. Is PyTorch using multiprocessing despite I didn’t call it? Is something going wrong with the multiprocessing that’s being used?

Thank you for any advice!

2022-01-19 05:46:06,274 INFO    MainThread:10668 [wandb_setup.py:_flush():71] setting env: {}
2022-01-19 05:46:06,274 INFO    MainThread:10668 [wandb_setup.py:_flush():71] setting login settings: {}
2022-01-19 05:46:06,275 INFO    MainThread:10668 [wandb_init.py:_log_setup():371] Logging user logs to C:\robofish\src\robofish\wandb\run-20220119_054606-1r2opu6p\logs\debug.log
2022-01-19 05:46:06,276 INFO    MainThread:10668 [wandb_init.py:_log_setup():372] Logging internal logs to C:\robofish\src\robofish\wandb\run-20220119_054606-1r2opu6p\logs\debug-internal.log
2022-01-19 05:46:06,277 INFO    MainThread:10668 [wandb_init.py:init():404] calling init triggers
2022-01-19 05:46:06,277 INFO    MainThread:10668 [wandb_init.py:init():411] wandb.init called with sweep_config: {}
config: {}
2022-01-19 05:46:06,279 INFO    MainThread:10668 [wandb_init.py:init():460] starting backend
2022-01-19 05:46:06,279 INFO    MainThread:10668 [backend.py:_multiprocessing_setup():101] multiprocessing start_methods=spawn, using: spawn
2022-01-19 05:46:06,292 INFO    MainThread:10668 [backend.py:ensure_launched():216] starting backend process...
2022-01-19 05:46:06,459 INFO    MainThread:10668 [backend.py:ensure_launched():222] started backend process with pid: 7480
2022-01-19 05:46:06,460 INFO    MainThread:10668 [wandb_init.py:init():469] backend started and connected
2022-01-19 05:46:06,465 INFO    MainThread:10668 [wandb_init.py:init():533] updated telemetry
2022-01-19 05:46:06,541 INFO    MainThread:10668 [wandb_init.py:init():563] communicating current version
2022-01-19 05:46:08,603 INFO    MainThread:10668 [wandb_init.py:init():568] got version response 
2022-01-19 05:46:08,604 INFO    MainThread:10668 [wandb_init.py:init():578] communicating run to backend with 30 second timeout
2022-01-19 05:46:08,975 INFO    MainThread:10668 [wandb_init.py:init():606] starting run threads in backend
2022-01-19 05:46:12,462 INFO    MainThread:10668 [wandb_run.py:_console_start():1810] atexit reg
2022-01-19 05:46:12,464 INFO    MainThread:10668 [wandb_run.py:_redirect():1684] redirect: SettingsConsole.WRAP
2022-01-19 05:46:12,464 INFO    MainThread:10668 [wandb_run.py:_redirect():1721] Wrapping output streams.
2022-01-19 05:46:12,468 INFO    MainThread:10668 [wandb_run.py:_redirect():1745] Redirects installed.
2022-01-19 05:46:12,468 INFO    MainThread:10668 [wandb_init.py:init():633] run started, returning control to user process
2022-01-19 05:46:12,469 INFO    MainThread:10668 [wandb_run.py:_config_callback():956] config_cb None None {}
2022-01-19 06:25:44,644 INFO    MainThread:10668 [wandb_run.py:_atexit_cleanup():1780] got exitcode: 0
2022-01-19 06:25:44,646 INFO    MainThread:10668 [wandb_run.py:_restore():1752] restore
2022-01-19 06:25:44,971 INFO    MainThread:10668 [wandb_run.py:_wait_for_finish():1912] got exit ret: file_counts {

scottire · January 22, 2022, 10:54am

This could be a Lightning issue so you may want to consult their forum. To help others help you, you may want to build a reproducible example and share it in your post.