How to Exit in DDP

If I raise a SystemExit, only the process encountered while exit, while rest are waiting infinitely.

Processes participating in a distributed data parallel job communicate with each other using collective communication calls (e.g. all_reduce, all_gather). If one of your processes fails, those calls will block the rest of your processes. Fault tolerance and job retries are not part of the DDP framework. I would suggest checking TorchElastic, and optionally Slurm and other OSS job schedulers for what you want to achieve.

If one process fails, it will exit. For example, if I raise a SystemError, the whole process exits as expected.

1 Like