The code I’m working on randomly used to get stuck. I was able to come up with a minimal example that I found had similar behavior. (ran a loop of 100 runs and it got stuck at some point; In the example, I used the Office-Home dataset, but I suppose the specific dataset doesn’t matter)
Here’s the stack trace when I Ctrl+c’ed :
[15:32 26-08-2020] Step  Loss : 12.5213
[15:32 26-08-2020] Step  Loss : 12.5203
[15:32 26-08-2020] Step  Loss : 12.5186
[15:32 26-08-2020] Step  Loss : 12.5178
[15:32 26-08-2020] Step  Loss : 12.5234
[15:32 26-08-2020] Step  Loss : 12.5203
^CTraceback (most recent call last):
File "tests/test_dl_wo_wandb.py", line 65, in <module>
imgs3 = imgs3.cuda(non_blocking=True)
File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
data = self._next_data()
File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
idx, data = self._get_data()
File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 931, in _get_data
success, data = self._try_get_data()
File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/queue.py", line 179, in get
File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
RuntimeError: DataLoader worker (pid 3508) is killed by signal: Terminated.
Is there something I’m doing in the example that is known to possibly deadlock?
Bumping the thread. Are there any suggestions that could possibly avoid this problem? At the surface, there does not seem to be anything wrong with being able to do this.
I do have a lot of workers in the example to try and replicate the problem, but with the code I am working on, I have found this problem with fewer (~8) per loader for 3 loaders in the training loop (and another 8 each for a test loader and a val loader, validation done every 500 training steps).
I cannot reproduce the hang using
That being said, note that you are trying to use 60 total processes to load the data and also set the number of threads to 40 via
torch.set_num_threads. Depending on your node I would assume this could create multiple issues.
Thanks for the response. I have a couple questions around this:
- Yes, the code I am using needs to spawn more processes than the number of cpu cores it gets allotted. I thought this is okay (and I may not be speaking with a lot of knowledge about this) if the kernel just shares the load among these processes like any others. Is this something inadvisable, i.e., should num_workers over all loaders be less than the number of cpu cores?
set_num_threads the best way to restrict the program to a certain number of cpu cores on a server?
- I think file reads do not need any locks, and multiple processes can read from a file. Any reason you can think of why
ImageFolder might behave differently from
FakeData? (I am currently trying out
FakeData on my end just to record the result)
Just another update : I tried calling del() on the iterators explicitly before reinitializing them (to try to make sure the workers exit properly), and I still get the program stuck at some point.
Thanks a lot! I will take a look at the post you linked. Just FYI, I found the hang with FakeData as well. I hadn’t been keeping track of when it happened (just sometime in a loop of 100). This time it hung at run 5.
Hi, I am running probably the same issue. The training would suddenly hang at seemingly random places, and when I send KeyBoardInterrupt, I found the program stuck at the same place as you mentioned.
I am wondering if any suggestions to solve this issue?
This problem is still there, even in 2023 using multiple workers for dataloading. Settings the number of workers to 0 solves the problem, but this is not a solution. Taining gets much longer due to this. Does anyone has a solution ?