Distributed data parallel freezes without error message

there might be a NCCL deadlock happening in the distributed setting (which is why you saw a freeze). We’ve identified this last week. I am issuing fixes for this.

6 Likes