Intel E810 RoCE NCCL unhandled system error

I applyed the patch mentioned in the github issue, and nccl-test passed on multi node. I think what I can do now is to re-compile PyTorch with patched NCCL.

1 Like