Torch.distributed.elastic is not stable

KaiHoo · April 22, 2022, 2:00am

Not sure if this is a known issue. After I upgrade the torch version from 1.8 to 1.11, it uses torch.distributed.elastic and says torch.distributed.launch is deprecated. However the training of my programs will easily get the following error and shut down. I tried on different machines, and the error can happen frequently with torch==1.11 torch.distributed.elastic but never with torch==1.8 torch.distributed.launch. Any idea to solve this issue? It seems I have to keep my torch version at 1.8.

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34837 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34838 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34839 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34840 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34841 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34842 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34843 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34844 closing signal SIGHUP
Traceback (most recent call last):
File “/home/kai/miniconda3/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/kai/miniconda3/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py”, line 193, in
main()
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py”, line 189, in main
launch(args)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py”, line 174, in launch
run(args)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py”, line 715, in run
elastic_launch(
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 236, in launch_agent
result = agent.run()
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py”, line 125, in wrapper
result = f(*args, **kwargs)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 709, in run
result = self._invoke_run(role)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 850, in _invoke_run
time.sleep(monitor_interval)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py”, line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 34822 got signal: 1

cbalioglu · May 2, 2022, 6:41pm

It is pretty hard to diagnose your issue with just this stack trace. Do you mind opening a GitHub Issue and describing your problem in a bit more detail there?

KaiHoo · May 5, 2022, 3:08pm

Hi @cbalioglu I have opened the issue at Torch.distributed.elastic is not stable · Issue #76894 · pytorch/pytorch · GitHub. Thanks!

KaiHoo · July 4, 2022, 7:24am

Reason here:

github.com/pytorch/pytorch

Torch.distributed.elastic is not stable

opened 03:08PM - 05 May 22 UTC

closed 07:24AM - 04 Jul 22 UTC

hukkai

oncall: distributed module: elastic oncall: r2p

### 🐛 Describe the bug When I use torch>=1.9, it uses torch.distributed.elasti…c and says torch.distributed.launch is deprecated. However the training of my programs will easily get the following error and shut down. However, it is not always happening, just with a not low probability. I tried on different machines, and the error can happen frequently with torch==1.11 torch.distributed.elastic but never with torch==1.8 torch.distributed.launch. For example, I use the official training scripts provided by torchvision (https://github.com/pytorch/vision/blob/main/references/classification/train.py) and only changes the model initialization to train my model, i.e., the only change is ``` # model = torchvision.models.__dict__[args.model](weights=args.weights, num_classes=num_classes) model = MyModel(num_classes=num_classes) ``` and I use the official running scripts: ``` torchrun --nproc_per_node=8 train.py --model resnet50 --batch-size 128 --lr 0.5 \ --lr-scheduler cosineannealinglr --lr-warmup-epochs 5 --lr-warmup-method linear \ --auto-augment ta_wide --epochs 600 --random-erase 0.1 --weight-decay 0.00002 \ --norm-weight-decay 0.0 --label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 \ --train-crop-size 224 --model-ema --val-resize-size 232 --ra-sampler --ra-reps 4 \ --data-path '/mnt/data/imagenet/' --amp --resume 'checkpoint.pth' ``` The error log is: ``` Epoch: [229] Total time: 0:17:21 Test: [ 0/49] eta: 0:05:00 loss: 1.7994 (1.7994) acc1: 78.0822 (78.0822) acc5: 95.2055 (95.2055) time: 6.1368 data: 5.9411 max mem: 10624 WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44348 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44349 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44350 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44351 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44352 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44353 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44354 closing signal SIGHUP Traceback (most recent call last): File "/home/biometrics/miniconda3/envs/torch/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==1.12.0.dev20220502', 'console_scripts', 'torchrun')()) File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run )(*cmd_args) File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent result = agent.run() File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(*args, **kwargs) File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run time.sleep(monitor_interval) File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 44343 got signal: 1 ``` I also find that if the runtime of one epoch is increasing before this error happens (from 10mins/epoch to 17 mins/epoch). I have encountered this problem many times on different tasks with and without torchvision. If the version of torch is >= 1.9, this problem can happen frequently while the problem never happens if I use torch<=1.8.1 without changing the codes. ### Versions Here is the version information. I have tried torch==1.10, 1.11 and 1.12. The error can happen regardless of the version . Collecting environment information... PyTorch version: 1.12.0.dev20220502 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.6 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2 Libc version: glibc-2.17 Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-4.15.0-167-generic-x86_64-with-debian-buster-sid Is CUDA available: True CUDA runtime version: 10.1.105 GPU models and configuration: GPU 0: NVIDIA TITAN RTX GPU 1: NVIDIA TITAN RTX GPU 2: NVIDIA TITAN RTX GPU 3: NVIDIA TITAN RTX GPU 4: NVIDIA TITAN RTX GPU 5: NVIDIA TITAN RTX GPU 6: NVIDIA TITAN RTX GPU 7: NVIDIA TITAN RTX Nvidia driver version: 470.103.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] numpy==1.21.5 [pip3] torch==1.12.0.dev20220502 [pip3] torchvision==0.13.0.dev20220502 [conda] blas 1.0 mkl [conda] cudatoolkit 10.2.89 hfd86e86_1 [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py37h7f8727e_0 [conda] mkl_fft 1.3.1 py37hd3c417c_0 [conda] mkl_random 1.2.2 py37h51133e4_0 [conda] numpy 1.21.5 py37he7a7128_1 [conda] numpy-base 1.21.5 py37hf524024_1 [conda] pytorch 1.12.0.dev20220502 py3.7_cuda10.2_cudnn7.6.5_0 pytorch-nightly [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchvision 0.13.0.dev20220502 py37_cu102 pytorch-nightly cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang