This probably indicates that there might have been some sort of CUDA/NCCL deadlock causing these timeouts. There are a few ways to debug this:
- Set environment variable
NCCL_DEBUG=INFO
, this will print NCCL debugging information. - Set environment variable
TORCH_DISTRIBUTED_DETAIL=DEBUG
, this will add significant additional overhead but will give you an exact error if there are mismatched collectives.