DataLoader worker (pid xxx) is killed by signal: Hangup

seyuboglu · July 23, 2020, 1:38pm

I’m using a DataLoader with 28 workers. After training for 3.5k iterations (~ 6 hours), I got the following error:

    idx, data = self._get_data()
  File "...site-packages/torch/utils/data/dataloader.py", line 848, in _get_data
    success, data = self._try_get_data()
  File "...site-packages/torch/utils/data/dataloader.py", line 811, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "...python3.6/queue.py", line 173, in get
    self.not_empty.wait(remaining)
  File ".../python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)
  File "...site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 487) is killed by signal: Hangup.

Has anyone had DataLoader worker killed by SIGHUP?

There are a number of issues focused on DataLoader workers being killed by signal BusError and Killed. However, I can’t find any mention of DataLoader workers being killed by SIGHUP. My understanding of SIGHUP is that it is a signal sent to processes when their terminal is closed, so it strikes me as an odd signal for a worker process to be killed by.

Has anyone run into this before / have any insight into what kind of issue the SIGHUP signal would suggest?

Thanks!

ptrblck · July 25, 2020, 4:26am

SIGHUP is not necessarily raised, if a terminal is closed, and yesterday I’ve ran into it while debugging a faulty CUDA kernel.

Were you able to train for a whole epoch or are the mentioned 3.5k iterations still in the first epoch?
In the latter case, I would recommend to use num_workers=0 and rerun the code, which might yield a better error message in case the DataLoader isn’t able to load a sample.

You could also just try to iterate the DataLoader (with num_workers=0) standalone without training the model and make sure that all samples can be properly loaded.

seyuboglu · July 25, 2020, 2:47pm

Thanks for the reply – that’s good to know! I am able to train for a whole epoch even with num_workers>0; I just finished a job which completed without issue for 100 epochs. Still though, I’ve had a few more dataloaders killed by SIGHUP since I made this post. My __get_item__() includes some stochastic preprocessing, which could explain why some epochs complete and others do not. I’m also querying S3 through boto3 which maybe is causing issues inside a parallelized dataloader? From Torch’s MP best practices it sounds like some Python libraries use multiple threads and could lead to deadlock or other issues when used inside a DataLoader. Do you think a SIGHUP could raised for that kind of issue?

ptrblck · July 26, 2020, 2:48am

I don’t know, what kind of error will raise the SIGHUP.
What kind of stochastic processing are you using inside the Dataset?
Based on your description it seems you are also using multi-threading inside the __getitem__?