Handling signals in distributed train loop

I am going to train a model with several combinations of hyperparameters (make a hyperparameter search) and would like to terminate train loops that I consider unsuccessful, to save time.

I consider sending Unix signal to do it. However, is seems that torchrun intercepts all signals and converts them to SIGTERM

My code looks like following:

import signal
import torch

signal_received = False

def usr_signal_handler(signum, frame):
    global signal_received
    signal_received = True
    print('USR1 signal received')


signal.signal(signal.SIGUSR1, usr_signal_handler)

torch.distributed.init_process_group()

for _ in range(10):
    if int(os.environ['RANK']) == 0:
        p1, p2, p3 = generate_hyperparameters()
        torch.distributed.broadcast_object_list([p1, p2, p3], src=0)
    else:
        v = [None, None, None]
        torch.distributed.broadcast_object_list(v, src=0)
        p1, p2, p3 = v
    
    
    model = get_model(p1, p2, p3)

    for i in range(n_epochs):
        if signal_received:
            break
        dataloader = get_dataloader()
        for inputs, labels in dataloader:
            if signal_received:
                break
            outputs = model(inputs)
            loss = loss_func(outputs, labels)
           ....

torch.distributed.destroy_process_group()

then I run this code:

$ torchrun --nproc-per-node 2 --nnodes 1 --node-rank 0 --master-addr 192.168.1.123 --master-port 37176 train.py --other --parameters

Then I look at output of ps xf, find PIDs of processes, launched by torchrun, and send USR1 signal to the first one (I suppose, this process was granted rank 0):

$ kill -USR1 1048139

Result: all processes are killed with SIGTERM signal.

W0220 11:42:40.321000 1048055 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1048139 closing signal SIGTERM
W0220 11:42:40.324000 1048055 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1048140 closing signal SIGTERM
E0220 11:42:43.274000 1048055 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -10) local_rank: 0 (pid: 1048139) of binary: /home/user/.envs/env/bin/python3.11
Traceback (most recent call last):
File "/home/user/.envs/env/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/user/.envs/env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/user/.envs/env/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/user/.envs/env/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/user/.envs/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/.envs/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=========================================================
train.py FAILED
---------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
time      : 2025-02-20_11:42:40
host      : hostname
rank      : 0 (local_rank: 0)
exitcode  : -10 (pid: 1048139)
error_file: <N/A>
traceback : Signal 10 (SIGUSR1) received by PID 1048139
=========================================================

Is there any way to avoid this?
Studying sources of torch.distributed API has revealed that it installs handlers for SIGTERM, SIGINT, SIGHUP and SIGQUIT (that’s why I’ve decided to use SIGUSR1).

I should add that I have applied exactly this approach in single-GPU training, and it was working perfectly.

Hey,
I have exactly the same issue. Did you find a solution?
I work on a SLURM environment with a job time limit of 24h. Usually you use a an automatic resubmission script to resubmit the training job that utilizes the USR1 signal. torchrun intercepting this signal is quite problematic for me.

No. Unfortunately I haven’t found any solution.

I think I found actually the solution. Torchrun spawns the actual python script as a child process. If you pass the signal to the child process instead of to the torchrun process, it is not intercepted and works as intended.

Assuming you have the PID of the torchrun process, you can get the PID of the python child process with:
PID=$(ps --ppid "\$PID" -o pid= --sort=start_time | head -n 1)

This seems to work for me at the moment. I don’t know if this method has any other disadvantages.

Best,
Karol

I have done exactly that. Here is the quote of my original post

I haven’t stated that explicitly, but I have tried to send signals to the child process, spawned by torchrun.