Is there any way to disable the fault tolerance in Torchrun for debugging purposes?
Fault-tolerant Distributed Training with torchrun — PyTorch Tutorials 2.0.1+cu117 documentation as shown in this link, the current solution will take the entire process group to a restart where I cannot find the real problem after it runs into crashes.
I’m not sure, but can you try
--max_restarts=1 to see if that produces the crash earlier? (IIUC you want the system to crash fast, not attempt to restart?)
it seems that the
torch.elastic reset the NCCL context as well, is there any way to avoid restarting the NCCL(i.e.,
torch.distributed). Is there any way to let
torch.distributed only retry the last failed API call (e.g., allreduce) instead of restarting the whole
Yes, I want the system just crash and not attempt to restart.