Unexpected segmentation fault encountered in worker when loading dataset

SHM · October 19, 2022, 11:52pm

I encounter the following error when using DataLoader workers to load data.
I am using NeighborSampler in PyG as “loader” in run_main.py line 152 to load custom dataset, and use num_workers of os.cpu_count().

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1096707) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_main.py", line 152, in train
    for step, _ in enumerate(loader):
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in __next__
    data = self._next_data()
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
    idx, data = self._get_data()
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data
    success, data = self._try_get_data()
  File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1096707) exited unexpectedly

I am using Pytorch 1.12.0+cu116, one NVIDIA TITAN Xp GPU, CUDA version of 11.6, and Python version of 3.8.10.

I’ve searched a lot for this error, and found following solutions. However, they were all unhelpful.

Using num_workers of 0 or 1. When I run with num_workers of 0, it results in “corrupted double-linked list” error. When I set num_workers as 1, the same error (Unexpected segmentation fault encountered in worker) still occurs. Actually, I don’t want to lessen num_workers, because I am working on a pretty large dataset and lessening num_workers is a much slower option.
Increasing shared memory size. I’ve done this by adding none /dev/shm tmpfs defaults,size=MY_SIZEG 0 0 line in /etc/fstab and running mount -o remount /dev/shm. I’ve set MY_SIZE exactly as the size of main memory (which was previously 50% of main memory).
Changing Python version to <= 3.6.9. I’ve tried this, but the same error still occurs.
Checking that Python and dataset are mounted on the same disk. I’ve already verified that they are mounted on the same disk.

I’ve been struggling to fix this issue for several days, but I can’t find the right solution, and it really makes me frustrated. Could you please help me out?

ptrblck · October 20, 2022, 7:14am

Try to solve this issue first before trying to debug the one raised my multiple workers.

SHM · October 20, 2022, 9:45pm

I will try to fix this issue ASAP and share the results. Thanks.

SHM · October 22, 2022, 10:55am

I’ve figured this out. I was using a customized, synthesized dataset expanded from an existing dataset. I’ve re-synthesized the dataset and the error disappeared when applied on this new dataset. I’m not sure why, but it seems that the previous dataset was corrupted in some way. Thanks for your advice.

ptrblck · October 22, 2022, 1:06pm

Thanks for the follow up and good to hear you’ve solved the issue. I assume the error raised by multiple workers is also gone now?

SHM · October 22, 2022, 7:10pm

Yes, those symptoms were all gone now.

r00bi · May 17, 2023, 9:19pm

I am getting the ERROR: Unexpected segmentation fault encountered in worker.
but cannot figure out why when it gets to load validation set after first epoch to calculate the performance of training set, it returns this error:
Epoch 0: 95%|████████████████ERROR: Unexpected segmentation fault encountered in worker.6<00:11, 4.30it/s, loss=nan]
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File “/home/se/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/home/lib/python3.10/queue.py”, line 180, in get
self.not_empty.wait(remaining)
File “/home/lib/python3.10/threading.py”, line 324, in wait
gotit = waiter.acquire(True, timeout)
File “/home/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1202763) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1103, in _run
results = self._run_stage()
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1182, in _run_stage
self._run_train()
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1205, in _run_train
self.fit_loop.run()
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py”, line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 200, in run
self.on_advance_end()
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 250, in on_advance_end
self._run_validation()
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 308, in _run_validation
self.val_loop.run()
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py”, line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py”, line 121, in advance
batch = next(data_fetcher)
File “/home/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py”, line 184, in next
return self.fetching_function()
File “/home/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py”, line 265, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File “/home/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py”, line 280, in _fetch_next_batch
batch = next(iterator)
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 628, in next
data = self._next_data()
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1316, in _next_data
idx, data = self._get_data()
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1272, in _get_data
success, data = self._try_get_data()
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1133, in _try_get_data
raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly’.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1202763) exited unexpectedly

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/bin/casanovo”, line 8, in
sys.exit(main())
File “/home/lib/python3.10/site-packages/click/core.py”, line 1130, in call
return self.main(*args, **kwargs)
File “/home/lib/python3.10/site-packages/click/core.py”, line 1055, in main
rv = self.invoke(ctx)
File “/home/lib/python3.10/site-packages/click/core.py”, line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/home/lib/python3.10/site-packages/click/core.py”, line 760, in invoke
return __callback(*args, **kwargs)
File “/home/lib/python3.10/site-packages/casanovo/casanovo.py”, line 256, in main
model_runner.train(peak_path, peak_feature, peak_path_val, peak_feature_val, model, config)
File “/home/lib/python3.10/site-packages/casanovo/denovo/model_runner.py”, line 320, in train
trainer.fit(
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 608, in fit
call._call_and_handle_interrupt(
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py”, line 59, in _call_and_handle_interrupt
trainer.strategy.reconciliate_processes(traceback.format_exc())
File “/home/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py”, line 460, in reconciliate_processes
raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 1
Traceback (most recent call last):
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/home/lib/python3.10/queue.py”, line 180, in get
self.not_empty.wait(remaining)
File “/home/lib/python3.10/threading.py”, line 324, in wait
gotit = waiter.acquire(True, timeout)
File “/home/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1202763) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1103, in _run
results = self._run_stage()
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1182, in _run_stage
self._run_train()
File “/home/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py”, line 1205, in _run_train
self.fit_loop.run()
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py”, line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 200, in run
self.on_advance_end()
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 250, in on_advance_end
self._run_validation()
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 308, in _run_validation
self.val_loop.run()
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py”, line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/home/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py”, line 121, in advance
batch = next(data_fetcher)
File “/home/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py”, line 184, in next
return self.fetching_function()
File “/home/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py”, line 265, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File “/home/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py”, line 280, in _fetch_next_batch
batch = next(iterator)
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 628, in next
data = self._next_data()
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1316, in _next_data
idx, data = self._get_data()
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1272, in _get_data
success, data = self._try_get_data()
File “/home/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1133, in _try_get_data
raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly’.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1202763) exited unexpectedly

Killed

mohammad_moghadas · October 24, 2023, 11:22am

I have the same error!

Gregor · November 4, 2023, 12:49pm

@r00bi Any updates how you solved it?