subprocess.CalledProcessError: Command '['/share/software/user/open/python/3.9.0/bin/python3', '-u', 'main_pretrain.py', '--local_rank=3']' returned non-zero exit status 1

I’m trying to train my network on my university cluster on 4 A100 GPUs with 80 GBs RAM but it seems to be crashing mid-epoch with the following uninformative error:

[03:00:26.169445] Start training for 800 epochs
[03:00:29.881443] Epoch: [0]  [  0/157]  eta: 0:09:42  lr: 0.000000  loss: 2.5381 (2.5381)  time: 3.7102  data: 2.6209  max mem: 7930
[03:01:20.581232] Epoch: [0]  [ 20/157]  eta: 0:05:54  lr: 0.000000  loss: 2.2845 (2.3390)  time: 2.5349  data: 0.4167  max mem: 9074
[03:02:11.337795] Epoch: [0]  [ 40/157]  eta: 0:05:00  lr: 0.000000  loss: 2.3415 (2.3727)  time: 2.5378  data: 0.8262  max mem: 9074
[03:02:59.493182] Epoch: [0]  [ 60/157]  eta: 0:04:03  lr: 0.000001  loss: 2.2510 (2.3554)  time: 2.4077  data: 0.0760  max mem: 9074
[03:03:48.747358] Epoch: [0]  [ 80/157]  eta: 0:03:12  lr: 0.000001  loss: 2.3077 (2.3214)  time: 2.4627  data: 0.7480  max mem: 9074
[03:04:40.015038] Epoch: [0]  [100/157]  eta: 0:02:23  lr: 0.000001  loss: 1.8802 (2.2680)  time: 2.5633  data: 0.1273  max mem: 9074
[03:05:27.755995] Epoch: [0]  [120/157]  eta: 0:01:32  lr: 0.000001  loss: 1.9293 (2.2229)  time: 2.3870  data: 0.0006  max mem: 9074
Killing subprocess 57832
Killing subprocess 57833
Killing subprocess 57834
Killing subprocess 57835
Traceback (most recent call last):
  File "/share/software/user/open/python/3.9.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/share/software/user/open/python/3.9.0/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/share/software/user/open/python/3.9.0/bin/python3', '-u', 'main_pretrain.py', '--local_rank=3']' returned non-zero exit status 1.

I use the following to train:

LOGLEVEL=INFO python3 -u -m torch.distributed.launch --nproc_per_node=4 main_pretrain.py

What’s weird is that the same exact code runs on my local setup of 4 V100 GPUs with 40 GBs RAM.

Both setups are running on Pytorch 11.1 with CUDA 11. Any advice is greatly appreciated!

For future reference, setting
from torch.distributed.elastic.multiprocessing.errors import record
and the decorator @record should help with debugging! In my case it was a missing image file that was causing the crash!