Debug on process 3 terminated with signal SIGTERM

Trying to train using ddp on 4 GPUs but I’m getting a: process 3 terminated with signal SIGTERM
Which happens most the way through validation for some reason. Does anyone have any idea why this might happen or how I can debug it easier?

File “train_gpu.py”, line 210, in
main_local(hparam_trial)
File “train_gpu.py”, line 103, in main_local
trainer.fit(model)
File “/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 343, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File “/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn
while not spawn_context.join():
File “/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 107, in join
(error_index, name)
Exception: process 3 terminated with signal SIGTERM

1 Like

Is the validation loop running correctly on a single device?
Usually the error messages might be better when disabling multi-GPU runs and multiprocessing.

Hi, Bruce_muller:
Was your problem solved?
I have the same issue: 4 GPUs DDP training and terminated with signal SIGTERM.