Hi, I’m training a variant of Baidu’s deepspeech model using the code from this repository. The dataset is around 1000 hours of audio.
While running the train.py script, I’m running into this error:
Traceback (most recent call last):
File "train.py", line 411, in <module>
main()
File "train.py", line 219, in main
for i, (data) in enumerate(train_loader, start=start_iter):
File "/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 195, in __next__
idx, batch = self.data_queue.get()
File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
File "/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/connection.py", line 487, in Client
c = SocketClient(address)
File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
I’m using nvidia-docker for my environment and I’ve set the --ipc=host flag which is recommended for having shared memory. Anyone have any idea why I’m running into this?
I’ve not faced this issue in the past when I trained on a local installation of Pytorch (without docker) so I could try to run it again outside of docker but I’d like to keep my environment as a docker image.