RuntimeError: Shared memory manager connection has timed out

During my training process, I received the following error (probably appeared at the junction of two epochs):

Traceback (most recent call last):
  File "anaconda3/envs/SORTIP/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda3/envs/SORTIP/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "anaconda3/envs/SORTIP/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 417, in reduce_storage
    metadata = storage._share_filename_cpu_()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda3/envs/SORTIP/lib/python3.11/site-packages/torch/storage.py", line 297, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda3/envs/SORTIP/lib/python3.11/site-packages/torch/storage.py", line 334, in _share_filename_cpu_
    return super()._share_filename_cpu_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Shared memory manager connection has timed out

This error occurs randomly, so I can not locate it in my code or even reproduce it stably.

Does anyone have any idea about this? Thanks a lot~

1 Like

I still do not know the reason for this error. But I guess it may be caused by the dataloader multiprocessing between adjacent epochs (maybe the multiprocessing startup or shutdown). I re-read the document and find that persistent_workers=True can persistent the worker processes between epochs. If my assumption is correct, this option would bypass the error I met.

Now, I use persistent_workers=True in my dataloader, and I haven’t met this problem again so far. But I still don’t know what caused this, and I’m not sure it won’t happen again. If anyone can shed some light on this problem, it would be greatly appreciated.