I’m currently training using pytorch distributed with NCCL_ASYNC_ERROR_HANDLING and often run into an issue where one of the training ranks throws an exception but the job won’t fail until our configured NCCL timeout has been reached. I’d like to be able to immediately kill all ranks/processes when an exception is thrown by one of the ranks, is there a good way to accomplish this?