DDP Error: torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers

Since last week I upgraded pytorch to 1.10.0, this error occurred. Here’s the log:

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15342 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15343 closing signal SIGHUP
Traceback (most recent call last):
File “/home/anaconda3/envs/aa/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/anaconda3/envs/aa/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 193, in
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 189, in main
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 174, in launch
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 252, in launch_agent
result = agent.run()
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py”, line 125, in wrapper
result = f(*args, **kwargs)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 709, in run
result = self._invoke_run(role)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 843, in _invoke_run
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py”, line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 15265 got signal: 1

Hey @Kiuk_Chung, is there any BC-breaking change in 1.10?

@mrshenli No BC issue that I know of yet. Looks like in this case the worker (not the agent) died due to a SIGHUP. Could you share the command you are running, how you are running it (from a terminal, on a job scheduler, etc) and the script you are running?

@Kiuk_Chung I use shell script from a terminal, and the commad is just like :

CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node 2 --master_port=26500  training.py --arg1 --arg2

This error always appeared after 10 epochs, .

If the job terminates with a SIGHUP mid-execution then there’s something else other than torch.distributed.launch that is causing the job to fail (typically torch.distributed.lauch issues happen on startup not mid-execution).

Since your trainers died with a signal (SIGHUP) which is typically sent when the terminal is closed, you’ll have to dig through the log (console) output to see what the actual exception was (or where the program was before it got killed). I recommend you register this faulthandler to get the trace information : faulthandler — Dump the Python traceback — Python 3.10.0 documentation

Did you find a solution? I am running into the same problem.

1 Like

How do you submit job? I met the same problem when using nohup command (affected by terminal shutting down?). Now, I am trying to use screen command.

I am using 1.10 and I also face this problem…

1 Like

I just gave up using ddp :expressionless:

1 Like

I also use the nohup command and I’m sure it’s not about terminal shutting down.

1 Like

@wachhu Do you have a simple repro for this issue? It would help us narrow down the issue further.

@Kiuk_Chung Was wondering if there is a parent/controlling process involved here in our framework which might exit unexpectedly causing this issue?

Its hard to tell what the root cause was from the provided excerpt of the logs. I need the full logs. But from this line:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15342 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15343 closing signal SIGHUP

what is probably happening is that the launcher process (the one that is running torch.distributed.launch got a SIGHUP. To make sure the worker PIDs are not orphaned, torchelastic will forward any signals the launcher process received down to the worker process (code here: pytorch/api.py at master · pytorch/pytorch · GitHub).

In this case PIDs 15342 and 15343 are the worker PIDs that get sent a SIGHUP because the launcher PID (the parent) got a SIGHUP.

If the launcher is wrapped with nohup:

$ nohup python -m torch.distributed.launch ...

Then theoretically the SIGHUP shoud’ve been ignored and not passed-through to the launcher PID, but from the logs its clear that’s not what happened. FWIW here are the signals that are passed-through to the worker PIDs from the elastic agent: SIGTERM, SIGKILL, SIGHUP, SIGQUIT (for unix) and SIGTERM, SIGKILL (for windows). See: pytorch/api.py at master · pytorch/pytorch · GitHub

So I was able to repro this with nohup. Currently torch.distributed.launch, torchrun, torch.distributed.run will all not work with nohup since we register our own termination handler for SIGHUP, which overrides the ignore handler by nohup. If you need to close the terminal but keep the process running, then you can use tmux or screen.

1 Like

If you are using the server terminal to run the nohup ddp-program, torch.distributed.launch, torchrun, torch.distributed.run, you need to use the exit command to exit instead of closing the terminal directly. Nohup has a little bug when multi-process.

I am also having the same issue. any solutions for this?