Hi, I’m training a variant of Baidu’s deepspeech model using the code from this repository. The dataset is around 1000 hours of audio.
While running the train.py script, I’m running into this error:
Traceback (most recent call last): File "train.py", line 411, in <module> main() File "train.py", line 219, in main for i, (data) in enumerate(train_loader, start=start_iter): File "/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 195, in __next__ idx, batch = self.data_queue.get() File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/queues.py", line 345, in get return ForkingPickler.loads(res) File "/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd fd = df.detach() File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/connection.py", line 487, in Client c = SocketClient(address) File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused
I’m using nvidia-docker for my environment and I’ve set the --ipc=host flag which is recommended for having shared memory. Anyone have any idea why I’m running into this?
I’ve not faced this issue in the past when I trained on a local installation of Pytorch (without docker) so I could try to run it again outside of docker but I’d like to keep my environment as a docker image.