How to Exit in DDP

zyc · August 17, 2021, 3:50pm

If I raise a SystemExit, only the process encountered while exit, while rest are waiting infinitely.

cbalioglu · August 17, 2021, 4:27pm

Processes participating in a distributed data parallel job communicate with each other using collective communication calls (e.g. all_reduce, all_gather). If one of your processes fails, those calls will block the rest of your processes. Fault tolerance and job retries are not part of the DDP framework. I would suggest checking TorchElastic, and optionally Slurm and other OSS job schedulers for what you want to achieve.

zyc · August 24, 2021, 5:49am

If one process fails, it will exit. For example, if I raise a SystemError, the whole process exits as expected.