Torch.distributed.elastic is not stable

Not sure if this is a known issue. After I upgrade the torch version from 1.8 to 1.11, it uses torch.distributed.elastic and says torch.distributed.launch is deprecated. However the training of my programs will easily get the following error and shut down. I tried on different machines, and the error can happen frequently with torch==1.11 torch.distributed.elastic but never with torch==1.8 torch.distributed.launch. Any idea to solve this issue? It seems I have to keep my torch version at 1.8.

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34837 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34838 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34839 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34840 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34841 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34842 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34843 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 34844 closing signal SIGHUP
Traceback (most recent call last):
File “/home/kai/miniconda3/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/kai/miniconda3/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py”, line 193, in
main()
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py”, line 189, in main
launch(args)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py”, line 174, in launch
run(args)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py”, line 715, in run
elastic_launch(
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 236, in launch_agent
result = agent.run()
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py”, line 125, in wrapper
result = f(*args, **kwargs)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 709, in run
result = self._invoke_run(role)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 850, in _invoke_run
time.sleep(monitor_interval)
File “/home/kai/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py”, line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 34822 got signal: 1

It is pretty hard to diagnose your issue with just this stack trace. Do you mind opening a GitHub Issue and describing your problem in a bit more detail there?

Hi @cbalioglu I have opened the issue at Torch.distributed.elastic is not stable · Issue #76894 · pytorch/pytorch · GitHub. Thanks!

1 Like