Debugging the segmentation fault in multiprocessing

wentian_zhao · March 9, 2021, 1:01pm

I keep getting the same segmentation fault of dataloader worker process on a machine (the traceback is attached at the end). It only happens when num_workers > 0. I tried to run the same code on 3 different machines and this only happens on one of them. Sometimes when num_workers == 0 but multiple GPUs are used (with torch.nn.DataParallel), the program crashes, simply showing “Segmentation fault (core dumped)” without showing a python traceback.

Unluckily I’m too busy to provide some sample that could reproduce this exception at this moment. I’m just wondering if there is a way to catch the exception in the dataloader worker process or show which part of the dataloader is faulty? It’s also okay if there is a way to get the core dump file so that I can inspect it with gdb.

  File "/home/library/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/library/miniconda3/lib/python3.7/multiprocessing/queues.py", line 108, in get
    res = self._recv_bytes()
  File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 411, in _recv_bytes
    return self._recv(size)
  File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
  File "/home/library/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 745428) is killed by signal: Segmentation fault.

ptrblck · March 11, 2021, 9:14am

You could try to get the backtrace using gdb via:

gdb --args python script.pt args
...
run
...
bt

This should give you the backtrace, which would point towards the failing operation and might be helpful to debug this issue further.