I installed nccl2 (downloaded from https://developer.nvidia.com/nccl) as follows:
- sudo dpkg -i nccl-repo-XXXX.deb
- sudo apt update
- sudo apt install libnccl2 libnccl-dev
Then I built pytorch source. It seems to be ok, I tested pytorch/build/bin/ProcessGroupNCCLTest and it output
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful
But when I do
import torch.distributed as dist
print(dist.is_nccl_available())
It is False. and I cannot use nccl backend. What would be the possible cause?
My full-build log is at https://s3-us-west-2.amazonaws.com/deepingsource-temp-outgoing/build_log.txt
Thank you.