Thanks Howard.
Compiled and installed from source, and trying out the elastic_launch API.
One question is, is there a definitive list of every thing needed in environments for this elastic_launch? when using the script, there are bunch of things such as RANK, master_addr to make everything ready. However the elastic_launch’s config only has a subset of these variables.
Haven’t fully traced the code yet, just would like to double check first instead of trail and error finding them.
right now, I am able to get things kicked off but apparently some configurations is missing.
(pid=601762) "message": "EtcdException: Could not get the list of servers, maybe you provided the wrong host(s) to connect to?",
(pid=601762) "extraInfo": {
(pid=601762) "py_callstack": "Traceback (most recent call last):\n File \"/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/connection.py\", line 170, in _new_conn\n (self._dns_host, self.port), self.timeout, **extra_kw\n File \"/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/util/connection.py\", line 96, in create_connection\n
and
Traceback (most recent call last):
File "train_ray_local.py", line 169, in <module>
ray.get([client.train.remote(), client2.train.remote()])
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/worker.py", line 1481, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(EtcdException): ray::Network.train() (pid=601762, ip=10.231.13.71)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
ray::Network.train() (pid=601762, ip=10.231.13.71)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/http/client.py", line 972, in send
self.connect()
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect
conn = self._new_conn()
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f6a609edb50>: Failed to establish a new connection: [Errno 111] Connection refused