Unexpected segmentation fault encountered in worker when loading dataset

I encounter the following error when using DataLoader workers to load data.
I am using NeighborSampler in PyG as “loader” in run_main.py line 152 to load custom dataset, and use num_workers of os.cpu_count().

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
RuntimeError: DataLoader worker (pid 1096707) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_main.py", line 152, in train
    for step, _ in enumerate(loader):
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in __next__
    data = self._next_data()
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
    idx, data = self._get_data()
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data
    success, data = self._try_get_data()
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1096707) exited unexpectedly

I am using Pytorch 1.12.0+cu116, one NVIDIA TITAN Xp GPU, CUDA version of 11.6, and Python version of 3.8.10.

I’ve searched a lot for this error, and found following solutions. However, they were all unhelpful.

  • Using num_workers of 0 or 1. When I run with num_workers of 0, it results in “corrupted double-linked list” error. When I set num_workers as 1, the same error (Unexpected segmentation fault encountered in worker) still occurs. Actually, I don’t want to lessen num_workers, because I am working on a pretty large dataset and lessening num_workers is a much slower option.
  • Increasing shared memory size. I’ve done this by adding none /dev/shm tmpfs defaults,size=MY_SIZEG 0 0 line in /etc/fstab and running mount -o remount /dev/shm. I’ve set MY_SIZE exactly as the size of main memory (which was previously 50% of main memory).
  • Changing Python version to <= 3.6.9. I’ve tried this, but the same error still occurs.
  • Checking that Python and dataset are mounted on the same disk. I’ve already verified that they are mounted on the same disk.

I’ve been struggling to fix this issue for several days, but I can’t find the right solution, and it really makes me frustrated. Could you please help me out?

Try to solve this issue first before trying to debug the one raised my multiple workers.

I will try to fix this issue ASAP and share the results. Thanks.

I’ve figured this out. I was using a customized, synthesized dataset expanded from an existing dataset. I’ve re-synthesized the dataset and the error disappeared when applied on this new dataset. I’m not sure why, but it seems that the previous dataset was corrupted in some way. Thanks for your advice.

Thanks for the follow up and good to hear you’ve solved the issue. I assume the error raised by multiple workers is also gone now?

Yes, those symptoms were all gone now.