OS Error while running Fairseq BART example

Hi folks,

I was trying to re-run the CMM/DM fine-tuning example by following the instructions from the repo – https://github.com/pytorch/fairseq/blob/master/examples/bart/README.summarization.md#4-fine-tuning-on-cnn-dm-summarization-task

When I run the command, (Point #4 in the link), it starts the training loop, prints the progress bar, but then fails with an OS Error.

epoch 001:   0%|                                                                                                                            | 0/29399 [00:00<?, ?it/s]2020-08-21 05:36:29 | INFO | fairseq.trainer | begin training epoch 1
Traceback (most recent call last):
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
    df = multiprocessing.reduction.DupFd(fd)
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/multiprocessing/reduction.py", line 191, in DupFd
    return resource_sharer.DupFd(fd)
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/multiprocessing/resource_sharer.py", line 53, in __init__
    self._id = _resource_sharer.register(send, close)
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/multiprocessing/resource_sharer.py", line 77, in register
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/multiprocessing/resource_sharer.py", line 130, in _start
    self._listener = Listener(authkey=process.current_process().authkey)
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/multiprocessing/connection.py", line 438, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/projects/anaconda3/envs/py36-fairseq/lib/python3.6/multiprocessing/connection.py", line 576, in __init__
OSError: AF_UNIX path too long

Couldn’t find anything related to this online (except that it’s a Python error raised when the address path exceeds the UNIX limit).

Was wondering if anyone has encountered something similar.


Are you getting this error only when using the fairseq repository or also if you are using plain PyTorch code?
In the former case, I would recommend to create an issue in the fairseq repository, in the latter case, could you post an executable code snippet, so that we could reproduce this issue?

Thanks for replying @ptrblck .
There were some other issues with my hardware – it was running out of disk space.
Solving that also solved this issue.