I encountered a very strange problem that repeatedly happened.
I have a training started with the following command on a Linux server:
oarsub -l "host=1/gpuid=4,walltime=480:0:0" \ "/home/username/.env/py37/bin/python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /data/coco --output_dir /home/username/code/output --resume /home/username/code/output/checkpoint.pth"
After a few hours, the training was killed. And this happened every time I restarted it. Our system admin could not figure out what was wrong.
The std error messages (content of
OAR.<jobID>.stderr) are the following:
Traceback (most recent call last): File "/home/username/.local/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/username/.local/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/username/.env/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module> main() File "/home/username/.env/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/username/.env/py37/bin/python', '-u', 'main.py', '--coco_path', '/data/coco', '--output_dir', '/home/username/code/output', '--resume', '/home/username/code/output/checkpoint.pth']' died with <Signals.SIGKILL: 9>.
In the std output file
OAR.<jobID>.stdout, the last lines are the following:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
This message is displayed only at the end of
OAR.<jobID>.stdout, when the crash happened, so maybe it has something to do with the crash.
Could you please help? Thank you very much in advance!