I’m trying to debug a “ConnectionRefusedError: [Errno 111]” error:
(RayTrainWorker pid=3779494) This exception is thrown by __iter__ of IterableWrapperIterDataPipe(deepco
py=True, iterable=<ray.train.torch.train_loop_utils._WrappedDataLoader object at 0x7f07bf4bf1c0>)
(RayTrainWorker pid=3779494) Stack trace: Traceback (most recent call last):
(RayTrainWorker pid=3779494) File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/d
ataloader.py", line 1126, in _try_get_data
(RayTrainWorker pid=3779494) data = self._data_queue.get(timeout=timeout)
(RayTrainWorker pid=3779494) File "/home/ray/anaconda3/lib/python3.9/multiprocessing/queues.py", line
122, in get
(RayTrainWorker pid=3779494) return _ForkingPickler.loads(res)
(RayTrainWorker pid=3779494) File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/multiprocess
ing/reductions.py", line 305, in rebuild_storage_fd
(RayTrainWorker pid=3779494) fd = df.detach()
(RayTrainWorker pid=3779494) File "/home/ray/anaconda3/lib/python3.9/multiprocessing/resource_sharer.
py", line 57, in detach
(RayTrainWorker pid=3779494) with _resource_sharer.get_connection(self._id) as conn:
(RayTrainWorker pid=3779494) File "/home/ray/anaconda3/lib/python3.9/multiprocessing/resource_sharer.
py", line 86, in get_connection
(RayTrainWorker pid=3779494) c = Client(address, authkey=process.current_process().authkey)
(RayTrainWorker pid=3779494) File "/home/ray/anaconda3/lib/python3.9/multiprocessing/connection.py",
line 507, in Client
(RayTrainWorker pid=3779494) c = SocketClient(address)
(RayTrainWorker pid=3779494) File "/home/ray/anaconda3/lib/python3.9/multiprocessing/connection.py",
line 635, in SocketClient
(RayTrainWorker pid=3779494) s.connect(address)
(RayTrainWorker pid=3779494) ConnectionRefusedError: [Errno 111] Connection refused
Normally, this error seems to happen when something else has gone wrong in the data processing methods, but looking through my logs – I don’t see anything other exception.
The only thing that is fishy is that, almost always, a few batches (~ 10) before, I’ll get the following message:
(RayTrainWorker pid=3779497) /home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/datapipes
/utils/common.py:295: FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is
deprecated since 0.4.0 and will be removed in 0.6.0.
(RayTrainWorker pid=3779497) See https://github.com/pytorch/data/issues/163 for details.
(RayTrainWorker pid=3779497) Please use `.open_files_by_fsspec()` instead.
(RayTrainWorker pid=3779497) warnings.warn(msg, FutureWarning)
(RayTrainWorker pid=3779497) /home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/datapipes
/utils/common.py:295: FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is
deprecated since 0.4.0 and will be removed in 0.6.0.
(RayTrainWorker pid=3779497) See https://github.com/pytorch/data/issues/163 for details.
(RayTrainWorker pid=3779497) Please use `.open_files_by_iopath()` instead.
(RayTrainWorker pid=3779497) warnings.warn(msg, FutureWarning)
This is kind of surprising because
-
open_file_by_iopath
is not used anywhere in my code. - This is happening in the middle of the training loop, which is suspicious, since no new datapipes should be constructed in the middle of an epoch.
Any thoughts on what could be going on?