How to disable PyTorch elastic restarts for non-retriable errors

bgedik · July 2, 2024, 11:25pm

I am using PyTorch distributed elastic for multi-gpu, multi-node runs. There are a fixed set of retries configured for my training jobs. But I want to have the ability to not retry and fail early, if the failure is such that it is non-retriable (for instance node level validations failed or there is a known kind of exception for which I don’t want to retry). I couldn’t find how elastic handles this. It seems to always restart user errors (aka exceptions from the trainer subprocess).