DDP Error: torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers

So I was able to repro this with nohup. Currently torch.distributed.launch, torchrun, torch.distributed.run will all not work with nohup since we register our own termination handler for SIGHUP, which overrides the ignore handler by nohup. If you need to close the terminal but keep the process running, then you can use tmux or screen.

5 Likes