Dataloader stop working ? Deadlock?

When I use pytorch to finetune ResNet, it runs well at the begining, but it stop running after several epoch.
I check nvidia-smi, about half memory is occupied, but GPU is not working, while CPU is almost 100%. It seems like that GPU is waiting for the data from Dataloader which is preprocessed by CPU. I interrupt with CTRL-C, it return some information, can anyong tell me what had happend and how to solve this problem. smaller batch_size, fewer num_workers? Any advice will be grateful…Thanks :slight_smile:

After I interrupt with CTRL-C

epoch: 17 lr: 0.05 loss: 5.825265407562256 acc_rate: 0.65875 total_num: 1600
^CProcess Process-70:
Process Process-72:
Traceback (most recent call last):
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/gitoo/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 50, in _worker_loop
r = index_queue.get()
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt

KeyboardInterrupt
Traceback (most recent call last):
File “train.py”, line 143, in
Process Process-69:
Traceback (most recent call last):
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/gitoo/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 50, in _worker_loop
r = index_queue.get()
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
main(args)
File “train.py”, line 70, in main
for i_batch, sample_batched in enumerate(dataloader):
File “/home/gitoo/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 275, in next
idx, batch = self._get_batch()
File “/home/gitoo/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 254, in _get_batch
return self.data_queue.get()
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/queues.py”, line 342, in get
res = self._reader.recv_bytes()
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/connection.py”, line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File “/home/gitoo/anaconda3/lib/python3.6/multiprocessing/popen_fork.py”, line 29, in poll
pid, sts = os.waitpid(self.pid, flag)
File “/home/gitoo/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 175, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 15761) exited unexpectedly with exit code 1.

Have you solved the problem? I’ve met the almost same one.

I have the same problem. Any solution?