Data loader crashes during training. Something to do with multiprocessing in docker


(Siddharth Gururani) #1

Hi, I’m training a variant of Baidu’s deepspeech model using the code from this repository. The dataset is around 1000 hours of audio.

While running the train.py script, I’m running into this error:

Traceback (most recent call last):
  File "train.py", line 411, in <module>
    main()
  File "train.py", line 219, in main
    for i, (data) in enumerate(train_loader, start=start_iter):
  File "/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 195, in __next__
    idx, batch = self.data_queue.get()
  File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/queues.py", line 345, in get
    return ForkingPickler.loads(res)
  File "/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/opt/conda/envs/pytorch-py35/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

I’m using nvidia-docker for my environment and I’ve set the --ipc=host flag which is recommended for having shared memory. Anyone have any idea why I’m running into this?
I’ve not faced this issue in the past when I trained on a local installation of Pytorch (without docker) so I could try to run it again outside of docker but I’d like to keep my environment as a docker image.


(Michal Romaniuk) #2

I’m also having this problem (but with Python 3.6 on Anaconda). It only happens when running on Docker (with --ipc=host).

In my case it seems to be caused by a problem in one of the worker processes:

Process Process-4:
Traceback (most recent call last):
  File "/miniconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/miniconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/miniconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 44, in _worker_loop
    data_queue.put((idx, samples))
  File "/miniconda3/lib/python3.6/multiprocessing/queues.py", line 349, in put
    obj = _ForkingPickler.dumps(obj)
  File "/miniconda3/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/miniconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 113, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_31_2143820281> at /opt/conda/conda-bld/pytorch_1502009910772/work/torch/lib/TH/THAllocator.c:271

This then seems to cause the error in the main process:

File "/miniconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 195, in __next__
    idx, batch = self.data_queue.get()
  File "/miniconda3/lib/python3.6/multiprocessing/queues.py", line 345, in get
    return _ForkingPickler.loads(res)
  File "/miniconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/miniconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/miniconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/miniconda3/lib/python3.6/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/miniconda3/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

(Alexandr Kalinin) #3

Did anyone solve this? I have the same problem running model in nvidia-docker.


#4

we’re not able to get a reliable repro of this problem to investigate and fix this.
If anyone can give a docker repro of the problem we can look into this further.


(Elias Vansteenkiste) #5

I also have the same problem.

Traceback (most recent call last):
  File "train.py", line 20, in <module>
    for i, data in enumerate(dataset):
  File "/home/elias/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 195, in __next__
    idx, batch = self.data_queue.get()
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
    return ForkingPickler.loads(res)
  File "/home/elias/.local/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

I don’t use conda or miniconda, maybe that rules something out?


(Elias Vansteenkiste) #6

Everything works fine if I set num_workers to 0.

self.dataloader = torch.utils.data.DataLoader(
            self.dataset,
            batch_size=opt.batchSize,
            shuffle=not opt.serial_batches,
            num_workers=int(opt.nThreads))

obviously


(Furiously Curious) #7

Looks like you are running out of disk space when loading and preprocessing with data loader.


(Elias Vansteenkiste) #8

No that is not the case. I monitored the work memory usage and the disk space usage and for both I have more than 50% free space.


(Elias Vansteenkiste) #9

I came up on a related problem.
If I don’t set the number of threads then sometimes the trainer hangs and I have to kill it with ctrl-c.

^CTraceback (most recent call last):
  File "train.py", line 20, in <module>
    for i, data in enumerate(dataset):
  File "/home/elias/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 195, in __next__
    idx, batch = self.data_queue.get()
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 343, in get
    res = self._reader.recv_bytes()
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)

(Zhixiang Wang) #10

I am now facing the same issue.


(Zhixiang Wang) #11

I solved this problem by increasing the size of shared memory by using --shm-size, i.e.,

nvidia-docker run -it --shm-size 8G --name xxxx xxxx

This issue dues to that the size of shared memory is not enough.


(Spandan Madan) #12

I am getting this same error without Docker. So there’s definitely something more to it! Did anyone find a fix?


(Pete Tae-hoon Kim) #13

I have the same issue. Exactly the same.
I ran on Ubuntu 16.04, two 1080 Ti. num_worker was 8, batch size was 96.

  File "train_cs.py", line 135, in <module>
    for i, data in enumerate(data_loader):
  File "/home/sipark/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 204, in __next__
    idx, batch = self.data_queue.get()
  File "/home/sipark/anaconda3/lib/python3.6/multiprocessing/queues.py", line 345, in get
    return _ForkingPickler.loads(res)
  File "/home/sipark/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/home/sipark/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/sipark/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/sipark/anaconda3/lib/python3.6/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/home/sipark/anaconda3/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

#14

Same here with python 2.7 and pytorch 0.2. Is it happening to anyone also on pytorch 0.3 or should I upgrade to try to solve this?

   batch = data_iter.next()
  File "/home/tals/env1/local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 195, in __next__
idx, batch = self.data_queue.get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
  File "/home/tals/env1/local/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
  File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
  File "/home/tals/env1/local/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
  File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
  File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

(shir gur) #15

Same for me :confused:


#16

Increasing the shared memory size solved this for me. It is used to share data between processes
Try to track your ram during training and make sure that there is no memory leak


(Vincentgu) #17

I solved this problem by changing the num_workers from 4 to 8. But I don`t know why.


#18

I had run into the same problem with CUDA 9.0 pytorch 0.3.1.
I tried to change the num_workers but it didn’t work.
Oddly enough, it is fixed after I doubled the batch_size.