If I raise a SystemExit
, only the process encountered while exit, while rest are waiting infinitely.
Processes participating in a distributed data parallel job communicate with each other using collective communication calls (e.g. all_reduce, all_gather). If one of your processes fails, those calls will block the rest of your processes. Fault tolerance and job retries are not part of the DDP framework. I would suggest checking TorchElastic, and optionally Slurm and other OSS job schedulers for what you want to achieve.
If one process fails, it will exit. For example, if I raise a SystemError
, the whole process exits as expected.
1 Like