I have installed pytorch by using conda and I can directly use nccl backend tro do distributed training. However, the internal nccl library of pytorch is 2.4.8. If I want use another manually installed nccl library such as 2.7.8 version, how can I do it? Is there any way without compiling pytroch from souce?
export USE_SYSTEM_NCCL=1, and then compile PyTorch from source.
See this discussion Torch distributed not working on two machines [nccl backend]
Thank you. Is there any way without compiling pytorch from source?
I don’t think there is an easy/safe way to do so, as the NCCL API also changes from release to release. Even if you can dynamically link libnccl, it might not be compatible with the built libtorch.