RuntimeError: NCCL communicator was aborted

pritamdamania87 · November 15, 2021, 10:20pm

This probably indicates that there might have been some sort of CUDA/NCCL deadlock causing these timeouts. There are a few ways to debug this:

Set environment variable NCCL_DEBUG=INFO, this will print NCCL debugging information.
Set environment variable TORCH_DISTRIBUTED_DETAIL=DEBUG, this will add significant additional overhead but will give you an exact error if there are mismatched collectives.