I know that I can dynamically add one worker into the worker group by executing a torchrun command. I want to know if there is a similar command to (on my initiative) terminate an existing worker without influencing other ongoing worker processes?
Torchrun manages worker processes on a host. If you want to terminate a single torchrun worker or the all of the workers on a host you can use kill ... with the PID. There’s no torch specific command for killing them.
Notably though, since they’re all linked if one worker dies all of the other workers will have to rerendezvous with each other (typically loading from checkpoint) so it’s possible “without influencing other ongoing worker processes”. This is an unfortunate side effect of NCCL not having any fault tolerance.
What’s the use case here for killing a single worker? If I know more might be able to give a better suggestion
Thanks for your response. I am working on scheduling DDP jobs and in some of my cases, the DDP job may shrink or expand its number of workers. I know that I can expand the number of workers by executing commands on a host, but I don’t find a Pytorch way to ‘shrink’ the worker number. So according to your suggestion, I just need to kill some processes on the host and then other processes will automatically re-organize properly (e.g., # of workers: 8->4), right? Or is there any better solution to achieve my goal?