How to kill entire training job if one GPU fails when using DDP


I’m using DDP to train on multiple GPUs and I’m using a shared server. I sometimes get errors where one GPU process will run out of memory for the dataloader and error out, but the other GPUs will keep going. Eventually, this results in the entire job freezing, but not erroring out (and resulting in large server costs without any progress being made). Is there any way to specify with DDP to kill all GPUs if one fails?


Hey @EricWiener,

TorchElastic is designed to recover from such errors, see: Torch Distributed Elastic — PyTorch 1.9.0 documentation

If you just need the non-failing processes to crash when any peer hit OOM, you can set a timeout when calling init_process_group, and set NCCL_ASYNC_ERROR_HANDLING env var when using NCCL backend. see the timeout argument docstring for init_process_group at Distributed communication package - torch.distributed — PyTorch master documentation