How can I use NCCL for distributed?

I installed nccl2 (downloaded from https://developer.nvidia.com/nccl) as follows:

  • sudo dpkg -i nccl-repo-XXXX.deb
  • sudo apt update
  • sudo apt install libnccl2 libnccl-dev

Then I built pytorch source. It seems to be ok, I tested pytorch/build/bin/ProcessGroupNCCLTest and it output

Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful

But when I do

import torch.distributed as dist
print(dist.is_nccl_available())

It is False. and I cannot use nccl backend. What would be the possible cause?

My full-build log is at https://s3-us-west-2.amazonaws.com/deepingsource-temp-outgoing/build_log.txt

Thank you.

1 Like

Oh, it seems to be an issue of the version, 166ee86b46721f6fd8f2c6ff4284787269fc36d1.
I downloaded 85d3fccee740bfa3493fab3f0bf7cea039e2c0bc and built again. Now it works well.
Thank you.

1 Like