I’m not sure, but can you try --max_restarts=1 to see if that produces the crash earlier? (IIUC you want the system to crash fast, not attempt to restart?)
it seems that the torch.elastic reset the NCCL context as well, is there any way to avoid restarting the NCCL(i.e., torch.distributed). Is there any way to let torch.distributed only retry the last failed API call (e.g., allreduce) instead of restarting the whole torch.distributed context?