After one epoch of successful training, always get this error message. what could be the reason?

purvang_lapsiwala · May 1, 2023, 3:50pm

2023-05-01 15:16:45 | INFO | yolox.core.trainer:259 - epoch: 2/300, iter: 250/313, mem: 11174Mb, iter_time: 3.676s, data_time: 3.100s, total_loss: 8.0, iou_loss: 3.3, l1_loss: 0.0, conf_loss: 3.4, cls_loss: 0.9, seg_loss: 0.3, lr: 1.294e-03, size: 384, ETA: 4 days, 0:29:34

Traceback (most recent call last):
File “/opt/conda/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/opt/conda/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/workspace/tools/train.py”, line 191, in
launch(
File “/workspace/yolox/core/launch.py”, line 82, in launch
mp.start_processes(
File “/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 198, in start_processes
while not context.join():
File “/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
root@e7968c684346:/workspace# /opt/conda/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

eqy · May 2, 2023, 7:34am

I would monitor if the memory usage (on the host side) is growing during training and nearing the system limit as this could trigger the SIGKILL.

If you are seeing memory usage growing, I would check e.g., that you detaching losses before computing statistics/otherwise retaining a computation graph unnecessarily.

purvang_lapsiwala · May 3, 2023, 5:53pm

Thanks @eqy . Furthur understanding and verifying dataset, I found that I only get error when number of workers in dataloader is greater than 0. workers=0 works fine. I am using cv2 to resize images inside dataloader getitem method. I also tried restricting cv2.numthreads=0 and mp.set_start_method(‘spawn’, force=True) but not helping.
Getting following error when increase workers > 0.

Traceback (most recent call last):
File “”, line 1, in
File “/opt/conda/lib/python3.9/multiprocessing/spawn.py”, line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File “/opt/conda/lib/python3.9/multiprocessing/spawn.py”, line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated

/opt/conda/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 96 leaked semap
hore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

eqy · May 3, 2023, 7:27pm

Just to be sure, I would also set environment variables for other backends like OMP_NUM_THREADS and MKL_NUM_THREADS and check if that makes a difference.