Python=3.11.4 | packaged by conda-forge
pytorch=2.0.1
pytorch-cuda=11.7
I use DistributedDataParallel for training and testing a model and getting the following error
Traceback (most recent call last):
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/local/scratch/ds858/nlsa/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
metadata = storage._share_filename_cpu_()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A similar error has been reported here: Unable to write to file </torch_18692_1954506624>
I am, however, not using Docker or anything like it. The shared memory should also be more than sufficient. It has been checked repeatedly.
The num_workers of the dataloader should also be 0, as it is left at its default value. (There are, however, multiple processes running on the GPUs at the same time.)
Having a brief glance at reductions.py, I found that the line above at which it had a problem is specific to the file_system sharing strategy. As a test, I switched to the file_descriptor strategy. This led to another error, which you can find below.
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local/scratch/ds858/nlsa/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd
fd = df.detach()
^^^^^^^^^^^
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 507, in Client
answer_challenge(c, authkey)
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 756, in answer_challenge
response = connection.recv_bytes(256) # reject large message
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 215, in recv_bytes
buf = self._recv_bytes(maxlength)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 382, in _recv
raise EOFError
OFError
I am grateful for any help or advice for addressing this issue.