RuntimeError: NCCL communicator was aborted

This probably indicates that there might have been some sort of CUDA/NCCL deadlock causing these timeouts. There are a few ways to debug this:

  1. Set environment variable NCCL_DEBUG=INFO, this will print NCCL debugging information.
  2. Set environment variable TORCH_DISTRIBUTED_DETAIL=DEBUG, this will add significant additional overhead but will give you an exact error if there are mismatched collectives.