I’m running a elastic training job with 4 workers. I have noticed that when one worker fails and causes all workers to restart, all workers restarted in process, but only the worker that failed had
TORCHELASTIC_RESTART_COUNT incremented upon restart, while the other workers still has the restart count at 0.
I use this count to check when to load previous checkpoints. Should this count be reliable enough to do that during multi-worker jobs? If not what would be a better way to detect such restarts?