Dataloader stucks whenever start training

ZeeshanH · July 20, 2020, 9:20am

Previously my training was working perfectly fine and trained the model till 27 epochs, but now when I resumed training from 28th epoch training freezes because dataloader stucks. I tried with num_worker=4 and also with number_workers=0. Initially number_workers=4 was working fine. I also tried rebooting my PC but problem remains. I manually stopped the training when it freezes and here is the traceback when I stopped training.

CTraceback (most recent call last):
  File "train_speech_embedder.py", line 225, in <module>
    train(hp.model.model_path)
  File "train_speech_embedder.py", line 80, in train
    for batch_id,X in enumerate(train_loader):
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/multiprocessing/connection.py", line 920, in wait
    ready = selector.select(timeout)
  File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
``

ptrblck · July 21, 2020, 8:16am

Is the problem only raised, if you are trying to continue the training in epoch 28 or also if you just restart the complete training?

ZeeshanH · July 21, 2020, 1:02pm

Whenever I try to restart the training same issue remains

ptrblck · July 23, 2020, 6:26am

It sounds like a system issue, if nothing works suddenly.
Did you change any drivers or are you running out of space/memory?
Could you restart the machine and if that doesn’t help use a docker container as a quick check?

ZeeshanH · July 23, 2020, 10:28am

Issue is resolved. IT was neither related to dataloader nor multiprocessing. There is a file whenever dataloader tried to load that it hanged so removed that file from training.
Thanks for your help

Jaideep_Valani · January 11, 2023, 6:07am

@ptrblck i am facing this issue very recurrently
my torch is 1.11
This is widely reported issue,not sure what is to blame for,any workaround

/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
    988         #   (bool: whether successfully get data, any: data if successful else None)
    989         try:
--> 990             data = self._data_queue.get(timeout=timeout)
    991             return (True, data)
    992         except Exception as e:

/opt/conda/lib/python3.8/multiprocessing/queues.py in get(self, block, timeout)
    105                 if block:
    106                     timeout = deadline - time.monotonic()
--> 107                     if not self._poll(timeout):
    108                         raise Empty
    109                 elif not self._poll():

/opt/conda/lib/python3.8/multiprocessing/connection.py in poll(self, timeout)
    255         self._check_closed()
    256         self._check_readable()
--> 257         return self._poll(timeout)
    258 
    259     def __enter__(self):

/opt/conda/lib/python3.8/multiprocessing/connection.py in _poll(self, timeout)
    422 
    423     def _poll(self, timeout):
--> 424         r = wait([self], timeout)
    425         return bool(r)
    426 

/opt/conda/lib/python3.8/multiprocessing/connection.py in wait(object_list, timeout)
    929 
    930             while True:
--> 931                 ready = selector.select(timeout)
    932                 if ready:
    933                     return [key.fileobj for (key, events) in ready]

/opt/conda/lib/python3.8/selectors.py in select(self, timeout)
    413         ready = []
    414         try:
--> 415             fd_event_list = self._selector.poll(timeout)
    416         except InterruptedError:
    417             return ready

KeyboardInterrupt: