Trying to train using ddp on 4 GPUs but I’m getting a: process 3 terminated with signal SIGTERM
Which happens most the way through validation for some reason. Does anyone have any idea why this might happen or how I can debug it easier?
File “train_gpu.py”, line 210, in
main_local(hparam_trial)
File “train_gpu.py”, line 103, in main_local
trainer.fit(model)
File “/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 343, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File “/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn
while not spawn_context.join():
File “/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 107, in join
(error_index, name)
Exception: process 3 terminated with signal SIGTERM