During my training process, I received the following error (probably appeared at the junction of two epochs):
Traceback (most recent call last):
File "anaconda3/envs/SORTIP/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/SORTIP/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "anaconda3/envs/SORTIP/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 417, in reduce_storage
metadata = storage._share_filename_cpu_()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/SORTIP/lib/python3.11/site-packages/torch/storage.py", line 297, in wrapper
return fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/SORTIP/lib/python3.11/site-packages/torch/storage.py", line 334, in _share_filename_cpu_
return super()._share_filename_cpu_(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Shared memory manager connection has timed out
This error occurs randomly, so I can not locate it in my code or even reproduce it stably.
Does anyone have any idea about this? Thanks a lot~