I keep getting the same segmentation fault of dataloader worker process on a machine (the traceback is attached at the end). It only happens when
num_workers > 0. I tried to run the same code on 3 different machines and this only happens on one of them. Sometimes when
num_workers == 0 but multiple GPUs are used (with
torch.nn.DataParallel), the program crashes, simply showing “Segmentation fault (core dumped)” without showing a python traceback.
Unluckily I’m too busy to provide some sample that could reproduce this exception at this moment. I’m just wondering if there is a way to catch the exception in the dataloader worker process or show which part of the dataloader is faulty? It’s also okay if there is a way to get the core dump file so that I can inspect it with gdb.
File "/home/library/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/library/miniconda3/lib/python3.7/multiprocessing/queues.py", line 108, in get res = self._recv_bytes() File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 411, in _recv_bytes return self._recv(size) File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) File "/home/library/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 745428) is killed by signal: Segmentation fault.