Hello,
In some weird cases (with scaling up and down), I get the following error:
{"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "elastic-imagenet-wxmlb", "global_rank": null, "group_rank": null, "worker_id": null, "role": "default", "hostname": "elastic-imagenet-wxmlb-worker-7", "state": "FAILED", "total_run_time": 0, "rdzv_backend": "etcd", "raw_error": "Traceback (most recent call last):\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 238, in launch_agent\n result = agent.run()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 700, in run\n result = self._invoke_run(role)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 822, in _invoke_run\n self._initialize_workers(self._worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 670, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 530, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 152, in next_rendezvous\n rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 284, in rendezvous_barrier\n return self.init_phase()\n File \"/job/.local/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/etcd_rendezvous.py\", line 346, in init_phase\n raise RendezvousClosedError()\ntorch.distributed.elastic.rendezvous.api.RendezvousClosedError\n", "metadata": "{\"group_world_size\": null, \"entry_point\": \"python\"}", "agent_restarts": 0}}
It is not clear from the logs nor the error message what closed the Rendezvous backend. This error forces the whole task to fail and without enough understanding to what’s going on I cannot fix this issue.
Any help?
Thank you very much