DataLoader worker is killed by signal - only on parallel runs

dvirginz · September 23, 2020, 8:25am

I know this error is extremely popular and there are a lot of threads about it, unfortunately, none discusses specifically on my scenario, and I wasn’t able to solve my problem using the previous posts.

Configuration:

Cuda 10.1
Python 3.7
pytorch 1.6

sysctl kernel.shmmax
kernel.shmmax = 18446744073692774399

2 GPUS, 20 cores.

The error ( DataLoader worker (pid 27351) is killed by signal: Killed ) occurs only when I try to dispatch two parallel runs, I.e

Works:
python train -gpus 0,1 num_data_workers 20

Crashes (both runs crash at the same time, with different pids) :
first shell:
python train -gpus 0 num_data_workers 5
second shell
python train -gpus 1 num_data_workers 5

Any idea what can cause that? My dataset does not use any shared storage (I load a NumPy array representing the whole dataset at the start of training).

ptrblck · September 24, 2020, 8:43am

Are you seeing the same error if num_workers=0 are used for the parallel run?

dvirginz · September 24, 2020, 8:59am

Luckely no, this is what I’m currently doing(one script uses multiple workers and the other is set to =0)

ptrblck · September 25, 2020, 8:44am

If you are using Windows, are you adding the if-clause protection as described in the Windows FAQ?

dvirginz · September 25, 2020, 11:12am

Nope, Ubuntu 18.04.
Ideaa?

ptrblck · September 25, 2020, 5:14pm

Usually you would see the DataLoader crash with this error if one of the workers encountered an error in loading the data.
Setting num_workers=0 and rerunning the code should yield the real error message.
Since that’s working fine my best guess is that “something” related to multiprocessing isn’t working properly. Maybe you could try to use a docker container and see, if it would work there?

dvirginz · September 26, 2020, 7:47am

Is there a chance that the dataloader will crash not during getItem?
I’m using a headless machine, thus creating a stub display using orca. I now realize that sometimes during parallel runs with workers=0 the system gets into a deadlock and hangs forever.
Does that may result in a dataloader crashing in a multithreaded scenario?

ptrblck · September 26, 2020, 8:06am

I don’t know but wouldn’t exclude this possibility.
Is your node properly working without orca?

dvirginz · September 26, 2020, 3:46pm

Yes, thank you for helping drilling down on the problem.
Unfortunately it’s orca, preserving shared memory in a greedy manner, resulting in oo shared memory when two processes want to use the library.

Thanks