I know this error is extremely popular and there are a lot of threads about it, unfortunately, none discusses specifically on my scenario, and I wasn’t able to solve my problem using the previous posts.
Cuda 10.1 Python 3.7 pytorch 1.6 sysctl kernel.shmmax kernel.shmmax = 18446744073692774399
2 GPUS, 20 cores.
The error ( DataLoader worker (pid 27351) is killed by signal: Killed ) occurs only when I try to dispatch two parallel runs, I.e
python train -gpus 0,1 num_data_workers 20
Crashes (both runs crash at the same time, with different pids) :
python train -gpus 0 num_data_workers 5
python train -gpus 1 num_data_workers 5
Any idea what can cause that? My dataset does not use any shared storage (I load a NumPy array representing the whole dataset at the start of training).