DDP Error: torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers

wachhu · November 2, 2021, 11:24am

Hi,
Since last week I upgraded pytorch to 1.10.0, this error occurred. Here’s the log:

Blockquote
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15342 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15343 closing signal SIGHUP
Traceback (most recent call last):
File “/home/anaconda3/envs/aa/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/anaconda3/envs/aa/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 193, in
main()
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 189, in main
launch(args)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launch.py”, line 174, in launch
run(args)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 252, in launch_agent
result = agent.run()
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py”, line 125, in wrapper
result = f(*args, **kwargs)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 709, in run
result = self._invoke_run(role)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py”, line 843, in _invoke_run
time.sleep(monitor_interval)
File “/home/anaconda3/envs/aa/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py”, line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 15265 got signal: 1

mrshenli · November 2, 2021, 2:47pm

Hey @Kiuk_Chung, is there any BC-breaking change in 1.10?

Kiuk_Chung · November 2, 2021, 3:53pm

@mrshenli No BC issue that I know of yet. Looks like in this case the worker (not the agent) died due to a SIGHUP. Could you share the command you are running, how you are running it (from a terminal, on a job scheduler, etc) and the script you are running?

wachhu · November 3, 2021, 2:42am

@Kiuk_Chung I use shell script from a terminal, and the commad is just like :

CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node 2 --master_port=26500  training.py --arg1 --arg2

This error always appeared after 10 epochs, .

Kiuk_Chung · November 3, 2021, 3:45am

If the job terminates with a SIGHUP mid-execution then there’s something else other than torch.distributed.launch that is causing the job to fail (typically torch.distributed.lauch issues happen on startup not mid-execution).

Since your trainers died with a signal (SIGHUP) which is typically sent when the terminal is closed, you’ll have to dig through the log (console) output to see what the actual exception was (or where the program was before it got killed). I recommend you register this faulthandler to get the trace information : faulthandler — Dump the Python traceback — Python 3.10.0 documentation

dabs · January 27, 2022, 1:40am

Did you find a solution? I am running into the same problem.

zhihe_lu · February 22, 2022, 7:48pm

How do you submit job? I met the same problem when using nohup command (affected by terminal shutting down?). Now, I am trying to use screen command.

Ziyu_Huang · March 1, 2022, 11:18am

I am using 1.10 and I also face this problem…

wachhu · March 1, 2022, 11:56am

I just gave up using ddp

wachhu · March 1, 2022, 12:00pm

I also use the nohup command and I’m sure it’s not about terminal shutting down.

pritamdamania87 · March 4, 2022, 4:02am

@wachhu Do you have a simple repro for this issue? It would help us narrow down the issue further.

@Kiuk_Chung Was wondering if there is a parent/controlling process involved here in our framework which might exit unexpectedly causing this issue?

Kiuk_Chung · March 11, 2022, 12:33am

Its hard to tell what the root cause was from the provided excerpt of the logs. I need the full logs. But from this line:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15342 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15343 closing signal SIGHUP

what is probably happening is that the launcher process (the one that is running torch.distributed.launch got a SIGHUP. To make sure the worker PIDs are not orphaned, torchelastic will forward any signals the launcher process received down to the worker process (code here: https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/agent/server/api.py#L716).

In this case PIDs 15342 and 15343 are the worker PIDs that get sent a SIGHUP because the launcher PID (the parent) got a SIGHUP.

If the launcher is wrapped with nohup:

$ nohup python -m torch.distributed.launch ...

Then theoretically the SIGHUP shoud’ve been ignored and not passed-through to the launcher PID, but from the logs its clear that’s not what happened. FWIW here are the signals that are passed-through to the worker PIDs from the elastic agent: SIGTERM, SIGKILL, SIGHUP, SIGQUIT (for unix) and SIGTERM, SIGKILL (for windows). See: https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/multiprocessing/api.py#L233-L237

Kiuk_Chung · March 11, 2022, 1:01am

So I was able to repro this with nohup. Currently torch.distributed.launch, torchrun, torch.distributed.run will all not work with nohup since we register our own termination handler for SIGHUP, which overrides the ignore handler by nohup. If you need to close the terminal but keep the process running, then you can use tmux or screen.

kings-rgb · July 29, 2022, 2:47am

If you are using the server terminal to run the nohup ddp-program, torch.distributed.launch, torchrun, torch.distributed.run, you need to use the exit command to exit instead of closing the terminal directly. Nohup has a little bug when multi-process.

Sunny_Sanyal · October 27, 2022, 7:51pm

I am also having the same issue. any solutions for this?

yuchang · May 6, 2023, 12:32pm

hi,i met the same problem, do you have any solutions now?

duo_ma · July 11, 2023, 1:34am

This solution is effective. Thanks a lot.

Ke_Jiang · August 28, 2023, 8:37am

I have the same problem. any idea of this?

nahidalam · September 15, 2023, 5:03pm

Anyone looking into this in future, this is how I was able to solve it with nohup

nohup your_command > output.log 2>&1 &

for example in this case

nohup python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py ... > output.log 2>&1 &

This should output you something like

[1] 232423

Here 1 is jobid and the big number is PID. Now you have to use the jobid with the disown command

disown %1

This seems to be working for me so far

drscotthawley · February 16, 2024, 10:47pm

As a regular user of nohup I was surprised to find this thread after receiving this error. I’ve never heard of another process, e.g. torch, taking over nohup’s no-hang-up functionality.

tmux and screen were suggested above, but without any guidance for us re. how to use them in a similar way to nohup. e.g., tmux creates a whole new “session” one has to manage somehow; this is not like nohup.

So, to whom it may concern, the following is the same bash function presented in original nohup form and new-improved tmux form – though I confess it was ChatGPT who provided the latter:

# for launching jobs
#   usage: launchrun <gpus> <logfile> <command and all args>
function launchrun() {
    CUDA_VISIBLE_DEVICES="$1" nohup "${@:3}" > "$2" 2>&1 &
}

# usage: launchrun <gpus> <logfile> <command and all args>
function launchrun() {
    gpus="$1"
    logfile="$2"
    shift 2  # Remove the first two arguments (gpus and logfile)

    # Generate a semi-unique session ID using date
    session_id="session_$(date +"%Y%m%d%H%M%S")"

    # Create a tmux session, run the command inside it, and detach
    tmux new-session -d -s "$session_id" "CUDA_VISIBLE_DEVICES=$gpus $* > $logfile 2>&1"

    # Optionally, you can also rename the window (replace "mywindow" with your desired name)
    # tmux rename-window -t "$session_id:0" mywindow

    # Detach from the session
    tmux detach-client

    echo "Session ID: $session_id"
}

I’d be interested in receiving comments or improvements to the above. Thanks!