I’m trying to train my network on my university cluster on 4 A100 GPUs with 80 GBs RAM but it seems to be crashing mid-epoch with the following uninformative error:
[03:00:26.169445] Start training for 800 epochs
[03:00:29.881443] Epoch: [0] [ 0/157] eta: 0:09:42 lr: 0.000000 loss: 2.5381 (2.5381) time: 3.7102 data: 2.6209 max mem: 7930
[03:01:20.581232] Epoch: [0] [ 20/157] eta: 0:05:54 lr: 0.000000 loss: 2.2845 (2.3390) time: 2.5349 data: 0.4167 max mem: 9074
[03:02:11.337795] Epoch: [0] [ 40/157] eta: 0:05:00 lr: 0.000000 loss: 2.3415 (2.3727) time: 2.5378 data: 0.8262 max mem: 9074
[03:02:59.493182] Epoch: [0] [ 60/157] eta: 0:04:03 lr: 0.000001 loss: 2.2510 (2.3554) time: 2.4077 data: 0.0760 max mem: 9074
[03:03:48.747358] Epoch: [0] [ 80/157] eta: 0:03:12 lr: 0.000001 loss: 2.3077 (2.3214) time: 2.4627 data: 0.7480 max mem: 9074
[03:04:40.015038] Epoch: [0] [100/157] eta: 0:02:23 lr: 0.000001 loss: 1.8802 (2.2680) time: 2.5633 data: 0.1273 max mem: 9074
[03:05:27.755995] Epoch: [0] [120/157] eta: 0:01:32 lr: 0.000001 loss: 1.9293 (2.2229) time: 2.3870 data: 0.0006 max mem: 9074
Killing subprocess 57832
Killing subprocess 57833
Killing subprocess 57834
Killing subprocess 57835
Traceback (most recent call last):
File "/share/software/user/open/python/3.9.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/share/software/user/open/python/3.9.0/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/share/software/user/open/python/3.9.0/bin/python3', '-u', 'main_pretrain.py', '--local_rank=3']' returned non-zero exit status 1.
I use the following to train:
LOGLEVEL=INFO python3 -u -m torch.distributed.launch --nproc_per_node=4 main_pretrain.py
What’s weird is that the same exact code runs on my local setup of 4 V100 GPUs with 40 GBs RAM.
Both setups are running on Pytorch 11.1 with CUDA 11. Any advice is greatly appreciated!