Hi,
I want to run multiple seperate training jobs using torchrun on the same node like:
torchrun --standalone --nnodes=1 --nproc_per_node=1 train.py --config my_config1
torchrun --standalone --nnodes=1 --nproc_per_node=1 train.py --config my_config2
it seems when one training job finishes, the other training jobs will throw an error like the one below and stop running
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 27378 closing signal SIGTERM
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'xyz' has failed to send a keep-alive heartbeat to the rendezvous 'abc' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'xyz' has failed to shutdown therendezvous 'abc' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "some_dir/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.0.0+cu118', 'console_scripts', 'torchrun')())
File "some_dir/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py",line 346, in wrapper
return f(*args, **kwargs)
File "some_dir/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "some_dir/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "some_dir/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "some_dir/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "some_dir/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "some_dir/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "some_dir/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 895, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
self._state_holder.sync()
File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
get_response = self._backend.get_state()
File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See innerexception for details.
Is there a way to avoid this? Thanks!