I applyed the patch mentioned in the github issue, and nccl-test passed on multi node. I think what I can do now is to re-compile PyTorch with patched NCCL.
1 Like
I applyed the patch mentioned in the github issue, and nccl-test passed on multi node. I think what I can do now is to re-compile PyTorch with patched NCCL.