DataLoader worker is killed by signal - only on parallel runs

I know this error is extremely popular and there are a lot of threads about it, unfortunately, none discusses specifically on my scenario, and I wasn’t able to solve my problem using the previous posts.


Cuda 10.1
Python 3.7
pytorch 1.6

sysctl kernel.shmmax
kernel.shmmax = 18446744073692774399

2 GPUS, 20 cores.

The error ( DataLoader worker (pid 27351) is killed by signal: Killed ) occurs only when I try to dispatch two parallel runs, I.e

python train -gpus 0,1 num_data_workers 20

Crashes (both runs crash at the same time, with different pids) :
first shell:
python train -gpus 0 num_data_workers 5
second shell
python train -gpus 1 num_data_workers 5

Any idea what can cause that? My dataset does not use any shared storage (I load a NumPy array representing the whole dataset at the start of training).

Are you seeing the same error if num_workers=0 are used for the parallel run?

Luckely no, this is what I’m currently doing(one script uses multiple workers and the other is set to =0)

If you are using Windows, are you adding the if-clause protection as described in the Windows FAQ?

Nope, Ubuntu 18.04.

Usually you would see the DataLoader crash with this error if one of the workers encountered an error in loading the data.
Setting num_workers=0 and rerunning the code should yield the real error message.
Since that’s working fine my best guess is that “something” related to multiprocessing isn’t working properly. Maybe you could try to use a docker container and see, if it would work there?

Is there a chance that the dataloader will crash not during getItem?
I’m using a headless machine, thus creating a stub display using orca. I now realize that sometimes during parallel runs with workers=0 the system gets into a deadlock and hangs forever.
Does that may result in a dataloader crashing in a multithreaded scenario?

I don’t know but wouldn’t exclude this possibility.
Is your node properly working without orca?

Yes, thank you for helping drilling down on the problem.
Unfortunately it’s orca, preserving shared memory in a greedy manner, resulting in oo shared memory when two processes want to use the library.