Hi,
Since last week I upgraded pytorch to 1.10.0, this error occurred. Here’s the log:
Blockquote
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15342 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15343 closing signal SIGHUP
Traceback (most recent call last):
File “/home/anaconda3/envs/aa/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/anaconda3/envs/aa/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 193, in
main()
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 189, in main
launch(args)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 174, in launch
run(args)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 252, in launch_agent
result = agent.run()
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py”, line 125, in wrapper
result = f(*args, **kwargs)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 709, in run
result = self._invoke_run(role)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 843, in _invoke_run
time.sleep(monitor_interval)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py”, line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 15265 got signal: 1
@mrshenli No BC issue that I know of yet. Looks like in this case the worker (not the agent) died due to a SIGHUP. Could you share the command you are running, how you are running it (from a terminal, on a job scheduler, etc) and the script you are running?
If the job terminates with a SIGHUP mid-execution then there’s something else other than torch.distributed.launch that is causing the job to fail (typically torch.distributed.lauch issues happen on startup not mid-execution).
Since your trainers died with a signal (SIGHUP) which is typically sent when the terminal is closed, you’ll have to dig through the log (console) output to see what the actual exception was (or where the program was before it got killed). I recommend you register this faulthandler to get the trace information : faulthandler — Dump the Python traceback — Python 3.10.0 documentation
Its hard to tell what the root cause was from the provided excerpt of the logs. I need the full logs. But from this line:
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15342 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15343 closing signal SIGHUP
In this case PIDs 15342 and 15343 are the worker PIDs that get sent a SIGHUP because the launcher PID (the parent) got a SIGHUP.
If the launcher is wrapped with nohup:
$ nohup python -m torch.distributed.launch ...
Then theoretically the SIGHUP shoud’ve been ignored and not passed-through to the launcher PID, but from the logs its clear that’s not what happened. FWIW here are the signals that are passed-through to the worker PIDs from the elastic agent: SIGTERM, SIGKILL, SIGHUP, SIGQUIT (for unix) and SIGTERM, SIGKILL (for windows). See: https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/multiprocessing/api.py#L233-L237
So I was able to repro this with nohup. Currently torch.distributed.launch, torchrun, torch.distributed.run will all not work with nohup since we register our own termination handler for SIGHUP, which overrides the ignore handler by nohup. If you need to close the terminal but keep the process running, then you can use tmux or screen.
If you are using the server terminal to run the nohup ddp-program, torch.distributed.launch, torchrun, torch.distributed.run, you need to use the exit command to exit instead of closing the terminal directly. Nohup has a little bug when multi-process.
As a regular user of nohup I was surprised to find this thread after receiving this error. I’ve never heard of another process, e.g. torch, taking over nohup’s no-hang-up functionality.
tmux and screen were suggested above, but without any guidance for us re. how to use them in a similar way to nohup. e.g., tmux creates a whole new “session” one has to manage somehow; this is not like nohup.
So, to whom it may concern, the following is the same bash function presented in original nohup form and new-improved tmux form – though I confess it was ChatGPT who provided the latter:
# for launching jobs
# usage: launchrun <gpus> <logfile> <command and all args>
function launchrun() {
CUDA_VISIBLE_DEVICES="$1" nohup "${@:3}" > "$2" 2>&1 &
}
# usage: launchrun <gpus> <logfile> <command and all args>
function launchrun() {
gpus="$1"
logfile="$2"
shift 2 # Remove the first two arguments (gpus and logfile)
# Generate a semi-unique session ID using date
session_id="session_$(date +"%Y%m%d%H%M%S")"
# Create a tmux session, run the command inside it, and detach
tmux new-session -d -s "$session_id" "CUDA_VISIBLE_DEVICES=$gpus $* > $logfile 2>&1"
# Optionally, you can also rename the window (replace "mywindow" with your desired name)
# tmux rename-window -t "$session_id:0" mywindow
# Detach from the session
tmux detach-client
echo "Session ID: $session_id"
}
I’d be interested in receiving comments or improvements to the above. Thanks!