Kill job if exception raised during NCCL AllReduce

bshama · March 8, 2024, 4:13am

Hi!

I’m currently training using pytorch distributed with NCCL_ASYNC_ERROR_HANDLING and often run into an issue where one of the training ranks throws an exception but the job won’t fail until our configured NCCL timeout has been reached. I’d like to be able to immediately kill all ranks/processes when an exception is thrown by one of the ranks, is there a good way to accomplish this?

Thanks!

bshama · March 11, 2024, 6:49pm

Resolved this by changing the way we kick off our torch jobs and changing our k8s jobs to kill the job if a single pod fails.