I know this error is extremely popular and there are a lot of threads about it, unfortunately, none discusses specifically on my scenario, and I wasn’t able to solve my problem using the previous posts.
The error ( DataLoader worker (pid 27351) is killed by signal: Killed ) occurs only when I try to dispatch two parallel runs, I.e
Works: python train -gpus 0,1 num_data_workers 20
Crashes (both runs crash at the same time, with different pids) :
first shell: python train -gpus 0 num_data_workers 5
second shell python train -gpus 1 num_data_workers 5
Any idea what can cause that? My dataset does not use any shared storage (I load a NumPy array representing the whole dataset at the start of training).
Usually you would see the DataLoader crash with this error if one of the workers encountered an error in loading the data.
Setting num_workers=0 and rerunning the code should yield the real error message.
Since that’s working fine my best guess is that “something” related to multiprocessing isn’t working properly. Maybe you could try to use a docker container and see, if it would work there?
Is there a chance that the dataloader will crash not during getItem?
I’m using a headless machine, thus creating a stub display using orca. I now realize that sometimes during parallel runs with workers=0 the system gets into a deadlock and hangs forever.
Does that may result in a dataloader crashing in a multithreaded scenario?
Yes, thank you for helping drilling down on the problem.
Unfortunately it’s orca, preserving shared memory in a greedy manner, resulting in oo shared memory when two processes want to use the library.