Previously my training was working perfectly fine and trained the model till 27 epochs, but now when I resumed training from 28th epoch training freezes because dataloader stucks. I tried with num_worker=4 and also with number_workers=0. Initially number_workers=4 was working fine. I also tried rebooting my PC but problem remains. I manually stopped the training when it freezes and here is the traceback when I stopped training.
CTraceback (most recent call last):
File "train_speech_embedder.py", line 225, in <module>
train(hp.model.model_path)
File "train_speech_embedder.py", line 80, in train
for batch_id,X in enumerate(train_loader):
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/multiprocessing/connection.py", line 920, in wait
ready = selector.select(timeout)
File "/home/pickledev/anaconda3/envs/torch_gpu/lib/python3.7/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
``
It sounds like a system issue, if nothing works suddenly.
Did you change any drivers or are you running out of space/memory?
Could you restart the machine and if that doesn’t help use a docker container as a quick check?
Issue is resolved. IT was neither related to dataloader nor multiprocessing. There is a file whenever dataloader tried to load that it hanged so removed that file from training.
Thanks for your help