subprocess.CalledProcessError: Command '['/share/software/user/open/python/3.9.0/bin/python3', '-u', 'main_pretrain.py', '--local_rank=3']' returned non-zero exit status 1

cyril · June 30, 2022, 10:26am

I’m trying to train my network on my university cluster on 4 A100 GPUs with 80 GBs RAM but it seems to be crashing mid-epoch with the following uninformative error:

[03:00:26.169445] Start training for 800 epochs
[03:00:29.881443] Epoch: [0]  [  0/157]  eta: 0:09:42  lr: 0.000000  loss: 2.5381 (2.5381)  time: 3.7102  data: 2.6209  max mem: 7930
[03:01:20.581232] Epoch: [0]  [ 20/157]  eta: 0:05:54  lr: 0.000000  loss: 2.2845 (2.3390)  time: 2.5349  data: 0.4167  max mem: 9074
[03:02:11.337795] Epoch: [0]  [ 40/157]  eta: 0:05:00  lr: 0.000000  loss: 2.3415 (2.3727)  time: 2.5378  data: 0.8262  max mem: 9074
[03:02:59.493182] Epoch: [0]  [ 60/157]  eta: 0:04:03  lr: 0.000001  loss: 2.2510 (2.3554)  time: 2.4077  data: 0.0760  max mem: 9074
[03:03:48.747358] Epoch: [0]  [ 80/157]  eta: 0:03:12  lr: 0.000001  loss: 2.3077 (2.3214)  time: 2.4627  data: 0.7480  max mem: 9074
[03:04:40.015038] Epoch: [0]  [100/157]  eta: 0:02:23  lr: 0.000001  loss: 1.8802 (2.2680)  time: 2.5633  data: 0.1273  max mem: 9074
[03:05:27.755995] Epoch: [0]  [120/157]  eta: 0:01:32  lr: 0.000001  loss: 1.9293 (2.2229)  time: 2.3870  data: 0.0006  max mem: 9074
Killing subprocess 57832
Killing subprocess 57833
Killing subprocess 57834
Killing subprocess 57835
Traceback (most recent call last):
  File "/share/software/user/open/python/3.9.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/share/software/user/open/python/3.9.0/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/share/software/user/open/python/3.9.0/bin/python3', '-u', 'main_pretrain.py', '--local_rank=3']' returned non-zero exit status 1.

I use the following to train:

LOGLEVEL=INFO python3 -u -m torch.distributed.launch --nproc_per_node=4 main_pretrain.py

What’s weird is that the same exact code runs on my local setup of 4 V100 GPUs with 40 GBs RAM.

Both setups are running on Pytorch 11.1 with CUDA 11. Any advice is greatly appreciated!

cyril · July 1, 2022, 12:18pm

For future reference, setting
from torch.distributed.elastic.multiprocessing.errors import record
and the decorator @record should help with debugging! In my case it was a missing image file that was causing the crash!