What closes Rendevezvous in torch elastic?

Hello,

In some weird cases (with scaling up and down), I get the following error:
{"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "elastic-imagenet-wxmlb", "global_rank": null, "group_rank": null, "worker_id": null, "role": "default", "hostname": "elastic-imagenet-wxmlb-worker-7", "state": "FAILED", "total_run_time": 0, "rdzv_backend": "etcd", "raw_error": "Traceback (most recent call last):\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 238, in launch_agent\n result = agent.run()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 700, in run\n result = self._invoke_run(role)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 822, in _invoke_run\n self._initialize_workers(self._worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 670, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 530, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 152, in next_rendezvous\n rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 284, in rendezvous_barrier\n return self.init_phase()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 346, in init_phase\n raise RendezvousClosedError()\ntorch.distributed.elastic.rendezvous.api.RendezvousClosedError\n", "metadata": "{\"group_world_size\": null, \"entry_point\": \"python\"}", "agent_restarts": 0}}
It is not clear from the logs nor the error message what closed the Rendezvous backend. This error forces the whole task to fail and without enough understanding to what’s going on I cannot fix this issue.
Any help?

Thank you very much

How are you scaling up and scaling down? The RendezvousClosedError is raised when the whole gang is not accepting anymore rendezvous (for example when a job if finished). Perhaps one of your nodes left the group and now is trying to rejoin when the job finished completing? It would be useful to have an example to reproduce what you are seeing.

cc @cbalioglu

Actually I realized now that this happens even without scaling. The output of the other node is:
INFO 2021-08-03 11:20:51,599 Rendezvous timeout occured in EtcdRendezvousHandler {"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "elastic-imagenet-dzq2v", "global_rank": null, "group_rank": null, "worker_id": null, "role": "default", "hostname": "elastic-imagenet-dzq2v-worker-0", "state": "FAILED", "total_run_time": 901, "rdzv_backend": "etcd", "raw_error": "Traceback (most recent call last):\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 238, in launch_agent\n result = agent.run()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 700, in run\n result = self._invoke_run(role)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 822, in _invoke_run\n self._initialize_workers(self._worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 670, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 530, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 152, in next_rendezvous\n rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 284, in rendezvous_barrier\n return self.init_phase()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 349, in init_phase\n return self.join_phase(state[\"version\"])\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 390, in join_phase\n active_version = self.wait_for_peers(expected_version)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 563, in wait_for_peers\n active_version, state = self.try_wait_for_state_change(\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 859, in try_wait_for_state_change\n raise RendezvousTimeoutError()\ntorch.distributed.elastic.rendezvous.api.RendezvousTimeoutError\n", "metadata": "{\"group_world_size\": null, \"entry_point\": \"python\"}", "agent_restarts": 0}}

Probably this node closed the Rendezvous?

Can you try using the new Rendezvous — PyTorch master documentation over the etcd rendezzvous and see if you still run into the same error?

Would you please explain how I can do this?
It seems I need to replace --rdzv_backend=etcd in the python -m torchelastic.distributed.launch command with something, right? What this thing should be?

Correct, you replace replace --rdzv_backend=etcd with --rdzv_backend=c10d. Command will look something like

python -m torch.distributed.run
    --nnodes=1:4
    --nproc_per_node=$NUM_TRAINERS
    --rdzv_id=$JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$HOST_NODE_ADDR
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

docs: torch.distributed.run (Elastic Launch) — PyTorch master documentation

Thanks for the answer.
What should I run on $HOST_NODE_ADDR (as replacement of etcd)?

If you are doing single node training you can use localhost otherwise you can select one of your machines and find it’s hostname and use that.

I am using the Kubernetes controller for my experiments. Is this new backend also supported by this controller? If yes, is there any tutorial I could follow regarding this setup?

You can add the --rdzv_backend=c10d flag in the args when you start your job using the operator

Hey @aguirguis I just wrote a tutorial for setting up YoloV5 using Pytorch and the C10d backend. You can follow this guide.

https://medium.com/p/8a4f07a77cf

You can refer the direct YAML I used here: Trainer.Yaml · GitHub

hi @H-Huang, regarding your comment on “The RendezvousClosedError is raised when the whole gang is not accepting anymore rendezvous (for example when a job if finished). Perhaps one of your nodes left the group and now is trying to rejoin when the job finished completing?”

if a node left and tries to rejoin, but the job hasn’t finished, should it still be accepted? I just saw the RendezvousClosedError when one node tried to rejoin, but the job is still running. This then caused the job to fail.