How to debug ConnectionRefusedError in Data Loader?

I’m trying to debug a “ConnectionRefusedError: [Errno 111]” error:

(RayTrainWorker pid=3779494) This exception is thrown by __iter__ of IterableWrapperIterDataPipe(deepco
py=True, iterable=<ray.train.torch.train_loop_utils._WrappedDataLoader object at 0x7f07bf4bf1c0>)
(RayTrainWorker pid=3779494) Stack trace: Traceback (most recent call last):
(RayTrainWorker pid=3779494)   File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/d
ataloader.py", line 1126, in _try_get_data
(RayTrainWorker pid=3779494)     data = self._data_queue.get(timeout=timeout)
(RayTrainWorker pid=3779494)   File "/home/ray/anaconda3/lib/python3.9/multiprocessing/queues.py", line
 122, in get
(RayTrainWorker pid=3779494)     return _ForkingPickler.loads(res)
(RayTrainWorker pid=3779494)   File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/multiprocess
ing/reductions.py", line 305, in rebuild_storage_fd
(RayTrainWorker pid=3779494)     fd = df.detach()
(RayTrainWorker pid=3779494)   File "/home/ray/anaconda3/lib/python3.9/multiprocessing/resource_sharer.
py", line 57, in detach
(RayTrainWorker pid=3779494)     with _resource_sharer.get_connection(self._id) as conn:
(RayTrainWorker pid=3779494)   File "/home/ray/anaconda3/lib/python3.9/multiprocessing/resource_sharer.
py", line 86, in get_connection
(RayTrainWorker pid=3779494)     c = Client(address, authkey=process.current_process().authkey)
(RayTrainWorker pid=3779494)   File "/home/ray/anaconda3/lib/python3.9/multiprocessing/connection.py", 
line 507, in Client
(RayTrainWorker pid=3779494)     c = SocketClient(address)
(RayTrainWorker pid=3779494)   File "/home/ray/anaconda3/lib/python3.9/multiprocessing/connection.py", 
line 635, in SocketClient
(RayTrainWorker pid=3779494)     s.connect(address)
(RayTrainWorker pid=3779494) ConnectionRefusedError: [Errno 111] Connection refused

Normally, this error seems to happen when something else has gone wrong in the data processing methods, but looking through my logs – I don’t see anything other exception.

The only thing that is fishy is that, almost always, a few batches (~ 10) before, I’ll get the following message:

(RayTrainWorker pid=3779497) /home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/datapipes
/utils/common.py:295: FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is 
deprecated since 0.4.0 and will be removed in 0.6.0.                                                   
(RayTrainWorker pid=3779497) See https://github.com/pytorch/data/issues/163 for details.               
(RayTrainWorker pid=3779497) Please use `.open_files_by_fsspec()` instead.                             
(RayTrainWorker pid=3779497)   warnings.warn(msg, FutureWarning)                                       
(RayTrainWorker pid=3779497) /home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/datapipes
/utils/common.py:295: FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is 
deprecated since 0.4.0 and will be removed in 0.6.0.                                                   
(RayTrainWorker pid=3779497) See https://github.com/pytorch/data/issues/163 for details.               
(RayTrainWorker pid=3779497) Please use `.open_files_by_iopath()` instead.                             
(RayTrainWorker pid=3779497)   warnings.warn(msg, FutureWarning)   

This is kind of surprising because

  1. open_file_by_iopath is not used anywhere in my code.
  2. This is happening in the middle of the training loop, which is suspicious, since no new datapipes should be constructed in the middle of an epoch.

Any thoughts on what could be going on?

Tagging @ejguan and @nivek , since this is a torchdata question.

open_file_by_iopath should not affect your program as it’s just a warning.

Could you please try to use IterableWrapper(ray_dl, deepcopy=False)?

BTW, which version of torchdata are you using?

pip list | grep torchdata

returns

 0.6.0.dev20221027

Thanks! This seemed to fix the issue. (not sure why)