Multiple training jobs using torchrun on the same node

pepper8362 · April 23, 2024, 7:55am

Hi,

I want to run multiple seperate training jobs using torchrun on the same node like:

torchrun --standalone --nnodes=1 --nproc_per_node=1 train.py --config my_config1
torchrun --standalone --nnodes=1 --nproc_per_node=1 train.py --config my_config2

it seems when one training job finishes, the other training jobs will throw an error like the one below and stop running

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 27378 closing signal SIGTERM
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'xyz' has failed to send a keep-alive heartbeat to the rendezvous 'abc' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'xyz' has failed to shutdown therendezvous 'abc' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "some_dir/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0+cu118', 'console_scripts', 'torchrun')())
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py",line 346, in wrapper
    return f(*args, **kwargs)
  File "some_dir/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "some_dir/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "some_dir/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "some_dir/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 895, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
    self._state_holder.sync()
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
    get_response = self._backend.get_state()
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "some_dir/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See innerexception for details.

Is there a way to avoid this? Thanks!

pepper8362 · April 30, 2024, 8:02am

I figured out that it can be solved by following the instructions here.