Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

did you solve it? I met the same question, and I am confused why NumelIn=1, NumelOut=1