I am going through my dataset using the data loader and I get the following error:
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 986, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/multiprocessing/queues.py”, line 107, in get
if not self._poll(timeout):
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/multiprocessing/connection.py”, line 257, in poll
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/multiprocessing/connection.py”, line 424, in _poll
r = wait([self], timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/multiprocessing/connection.py”, line 931, in wait
ready = selector.select(timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/selectors.py”, line 415, in select
fd_event_list = self._selector.poll(timeout)
File “/home/kvougiou/miniconda3/envs/dev/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
RuntimeError: DataLoader worker (pid 89199) is killed by signal: Segmentation fault.
The dataset that I am using is massive 1.6M videos but I can never seem to do even a single pass of my data before I see this crash. In the dataset, I am reading videos (using decord), reading audio (using torchaudio which shouldn’t be the problem since I have used this before on similar data and it works) and loading numpy arrays.
Specs of my machine:
128 GB Ram
CPU: AMD Ryzen Threadripper 1950X 16 Core CPU
The versions of the libraries I am using are:
My shared memory is:
kernel.shm_rmid_forced = 0
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096
I am using 10 workers in the dataloader
I have now looked at similar issues and I have no idea why this happens. It has happened even when setting workers=0 (this was before some changes I tried but I expect it to happen again). Also when I run the exact same code on a server with 256 RAM and way more cores it works so this is specific to this one machine. Does anyone have any idea on how I can debug this further because I am now stomped?
I noticed that if I increase the number of workers then the issue happens faster. If I increase to 32 (same as the cores) then it sometimes happens instantly. Then my computer goes to a state that it will keep segfaulting instantly on the code until I reduce the workers.