Possible deadlock? Training example gets stuck

Hi,

The code I’m working on randomly used to get stuck. I was able to come up with a minimal example that I found had similar behavior. (ran a loop of 100 runs and it got stuck at some point; In the example, I used the Office-Home dataset, but I suppose the specific dataset doesn’t matter)

Here’s the stack trace when I Ctrl+c’ed :

Starting training
[15:32 26-08-2020] Step [0] Loss : 12.5213
[15:32 26-08-2020] Step [1] Loss : 12.5203
[15:32 26-08-2020] Step [2] Loss : 12.5186
[15:32 26-08-2020] Step [3] Loss : 12.5178
[15:32 26-08-2020] Step [4] Loss : 12.5234
[15:32 26-08-2020] Step [5] Loss : 12.5203
^CTraceback (most recent call last):
  File "tests/test_dl_wo_wandb.py", line 65, in <module>
    imgs3 = imgs3.cuda(non_blocking=True)
  File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
    idx, data = self._get_data()
  File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 931, in _get_data
    success, data = self._try_get_data()
  File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/samarth/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3508) is killed by signal: Terminated.

Is there something I’m doing in the example that is known to possibly deadlock?

Much appreciated!

Bumping the thread. Are there any suggestions that could possibly avoid this problem? At the surface, there does not seem to be anything wrong with being able to do this.

I do have a lot of workers in the example to try and replicate the problem, but with the code I am working on, I have found this problem with fewer (~8) per loader for 3 loaders in the training loop (and another 8 each for a test loader and a val loader, validation done every 500 training steps).

I cannot reproduce the hang using torchvision.datasets.FakeData.
That being said, note that you are trying to use 60 total processes to load the data and also set the number of threads to 40 via torch.set_num_threads. Depending on your node I would assume this could create multiple issues.

Thanks for the response. I have a couple questions around this:

  1. Yes, the code I am using needs to spawn more processes than the number of cpu cores it gets allotted. I thought this is okay (and I may not be speaking with a lot of knowledge about this) if the kernel just shares the load among these processes like any others. Is this something inadvisable, i.e., should num_workers over all loaders be less than the number of cpu cores?
  2. Is set_num_threads the best way to restrict the program to a certain number of cpu cores on a server?
  3. I think file reads do not need any locks, and multiple processes can read from a file. Any reason you can think of why ImageFolder might behave differently from FakeData? (I am currently trying out FakeData on my end just to record the result)

Just another update : I tried calling del() on the iterators explicitly before reinitializing them (to try to make sure the workers exit properly), and I still get the program stuck at some point.

  1. You should see the best performance using an optimal number of workers. Increasing the number of workers beyond the number of cores might yield bad performance as explained here.

  2. torch.set_num_threads should set the number of threads for intraop parallelism on the CPU, so for MKL and OpenMP etc., if I’m not mistaken.

  3. Yes, I think loading from disc with too many processes might reduce the performance significantly e.g. due to threshing. I don’t know, if this could explain your hang, but I would try to reduce the number of workers etc. to “common” values.

Thanks a lot! I will take a look at the post you linked. Just FYI, I found the hang with FakeData as well. I hadn’t been keeping track of when it happened (just sometime in a loop of 100). This time it hung at run 5.

Hi, I am running probably the same issue. The training would suddenly hang at seemingly random places, and when I send KeyBoardInterrupt, I found the program stuck at the same place as you mentioned.

I am wondering if any suggestions to solve this issue?

This problem is still there, even in 2023 using multiple workers for dataloading. Settings the number of workers to 0 solves the problem, but this is not a solution. Taining gets much longer due to this. Does anyone has a solution ?