RuntimeError: DataLoader worker is killed by signal: Segmentation fault

Sfitsos · May 20, 2021, 9:05am

I am going through my dataset using the data loader and I get the following error:

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 986, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/multiprocessing/queues.py”, line 107, in get
if not self._poll(timeout):
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/multiprocessing/connection.py”, line 257, in poll
return self._poll(timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/multiprocessing/connection.py”, line 424, in _poll
r = wait([self], timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/multiprocessing/connection.py”, line 931, in wait
ready = selector.select(timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/selectors.py”, line 415, in select
fd_event_list = self._selector.poll(timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 89199) is killed by signal: Segmentation fault.

The dataset that I am using is massive 1.6M videos but I can never seem to do even a single pass of my data before I see this crash. In the dataset, I am reading videos (using decord), reading audio (using torchaudio which shouldn’t be the problem since I have used this before on similar data and it works) and loading numpy arrays.

Specs of my machine:
128 GB Ram
CPU: AMD Ryzen Threadripper 1950X 16 Core CPU

The versions of the libraries I am using are:
torch 1.8.1
torchaudio 0.8.0a0+e4e171a
torchmetrics 0.3.2
torchvision 0.9.1

My shared memory is:

kernel.shm_rmid_forced = 0
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096

Other info:
I am using 10 workers in the dataloader

I have now looked at similar issues and I have no idea why this happens. It has happened even when setting workers=0 (this was before some changes I tried but I expect it to happen again). Also when I run the exact same code on a server with 256 RAM and way more cores it works so this is specific to this one machine. Does anyone have any idea on how I can debug this further because I am now stomped?

Update:
I noticed that if I increase the number of workers then the issue happens faster. If I increase to 32 (same as the cores) then it sometimes happens instantly. Then my computer goes to a state that it will keep segfaulting instantly on the code until I reduce the workers.

czh · August 4, 2022, 3:30pm

Did you solve the issue? I think I am experiencing the same thing or something very similar. Any help would be appreciated.