What closes Rendevezvous in torch elastic?

aguirguis · July 30, 2021, 2:27pm

Hello,

In some weird cases (with scaling up and down), I get the following error:
{"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "elastic-imagenet-wxmlb", "global_rank": null, "group_rank": null, "worker_id": null, "role": "default", "hostname": "elastic-imagenet-wxmlb-worker-7", "state": "FAILED", "total_run_time": 0, "rdzv_backend": "etcd", "raw_error": "Traceback (most recent call last):\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 238, in launch_agent\n result = agent.run()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 700, in run\n result = self._invoke_run(role)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 822, in _invoke_run\n self._initialize_workers(self._worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 670, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 530, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 152, in next_rendezvous\n rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 284, in rendezvous_barrier\n return self.init_phase()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 346, in init_phase\n raise RendezvousClosedError()\ntorch.distributed.elastic.rendezvous.api.RendezvousClosedError\n", "metadata": "{\"group_world_size\": null, \"entry_point\": \"python\"}", "agent_restarts": 0}}
It is not clear from the logs nor the error message what closed the Rendezvous backend. This error forces the whole task to fail and without enough understanding to what’s going on I cannot fix this issue.
Any help?

Thank you very much

H-Huang · August 2, 2021, 7:52pm

How are you scaling up and scaling down? The RendezvousClosedError is raised when the whole gang is not accepting anymore rendezvous (for example when a job if finished). Perhaps one of your nodes left the group and now is trying to rejoin when the job finished completing? It would be useful to have an example to reproduce what you are seeing.

cc @cbalioglu

aguirguis · August 3, 2021, 11:46am

Actually I realized now that this happens even without scaling. The output of the other node is:
INFO 2021-08-03 11:20:51,599 Rendezvous timeout occured in EtcdRendezvousHandler {"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "elastic-imagenet-dzq2v", "global_rank": null, "group_rank": null, "worker_id": null, "role": "default", "hostname": "elastic-imagenet-dzq2v-worker-0", "state": "FAILED", "total_run_time": 901, "rdzv_backend": "etcd", "raw_error": "Traceback (most recent call last):\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 238, in launch_agent\n result = agent.run()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 700, in run\n result = self._invoke_run(role)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 822, in _invoke_run\n self._initialize_workers(self._worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 670, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 530, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 152, in next_rendezvous\n rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 284, in rendezvous_barrier\n return self.init_phase()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 349, in init_phase\n return self.join_phase(state[\"version\"])\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 390, in join_phase\n active_version = self.wait_for_peers(expected_version)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 563, in wait_for_peers\n active_version, state = self.try_wait_for_state_change(\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 859, in try_wait_for_state_change\n raise RendezvousTimeoutError()\ntorch.distributed.elastic.rendezvous.api.RendezvousTimeoutError\n", "metadata": "{\"group_world_size\": null, \"entry_point\": \"python\"}", "agent_restarts": 0}}

Probably this node closed the Rendezvous?

H-Huang · August 3, 2021, 9:01pm

Can you try using the new Rendezvous — PyTorch master documentation over the etcd rendezzvous and see if you still run into the same error?

aguirguis · August 4, 2021, 11:19am

Would you please explain how I can do this?
It seems I need to replace --rdzv_backend=etcd in the python -m torchelastic.distributed.launch command with something, right? What this thing should be?

H-Huang · August 4, 2021, 2:55pm

Correct, you replace replace --rdzv_backend=etcd with --rdzv_backend=c10d. Command will look something like

python -m torch.distributed.run
    --nnodes=1:4
    --nproc_per_node=$NUM_TRAINERS
    --rdzv_id=$JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$HOST_NODE_ADDR
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

docs: torch.distributed.run (Elastic Launch) — PyTorch master documentation

aguirguis · August 4, 2021, 3:56pm

Thanks for the answer.
What should I run on $HOST_NODE_ADDR (as replacement of etcd)?

H-Huang · August 4, 2021, 5:54pm

If you are doing single node training you can use localhost otherwise you can select one of your machines and find it’s hostname and use that.

aguirguis · August 6, 2021, 9:18am

I am using the Kubernetes controller for my experiments. Is this new backend also supported by this controller? If yes, is there any tutorial I could follow regarding this setup?

sachin_chandra · June 5, 2022, 2:53pm

You can add the --rdzv_backend=c10d flag in the args when you start your job using the operator

sachin_chandra · June 13, 2022, 2:11pm

Hey @aguirguis I just wrote a tutorial for setting up YoloV5 using Pytorch and the C10d backend. You can follow this guide.

https://medium.com/p/8a4f07a77cf

You can refer the direct YAML I used here: Trainer.Yaml · GitHub

f_g · May 8, 2023, 5:21pm

hi @H-Huang, regarding your comment on “The RendezvousClosedError is raised when the whole gang is not accepting anymore rendezvous (for example when a job if finished). Perhaps one of your nodes left the group and now is trying to rejoin when the job finished completing?”

if a node left and tries to rejoin, but the job hasn’t finished, should it still be accepted? I just saw the RendezvousClosedError when one node tried to rejoin, but the job is still running. This then caused the job to fail.