I keep getting the same segmentation fault of dataloader worker process on a machine (the traceback is attached at the end). It only happens when num_workers
> 0. I tried to run the same code on 3 different machines and this only happens on one of them. Sometimes when num_workers
== 0 but multiple GPUs are used (with torch.nn.DataParallel
), the program crashes, simply showing “Segmentation fault (core dumped)” without showing a python traceback.
Unluckily I’m too busy to provide some sample that could reproduce this exception at this moment. I’m just wondering if there is a way to catch the exception in the dataloader worker process or show which part of the dataloader is faulty? It’s also okay if there is a way to get the core dump file so that I can inspect it with gdb.
File "/home/library/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/library/miniconda3/lib/python3.7/multiprocessing/queues.py", line 108, in get
res = self._recv_bytes()
File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 411, in _recv_bytes
return self._recv(size)
File "/home/library/miniconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
File "/home/library/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 745428) is killed by signal: Segmentation fault.