Distributed DP not working with PyTorch built from source

I’m running into the following error while using DistributedDataParallel, using code very similar to the code here:

Traceback (most recent call last):
  File "/root/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/root/miniconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 476, in _reduction_thread_fn
    _process_batch()  # just to have a clear scope
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 460, in _process_batch
    nccl.reduce(dev_coalesced, root=0, streams=nccl_streams)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/cuda/nccl.py", line 51, in reduce
    torch._C._nccl_reduce(inputs, outputs, root, op, streams, comms)
RuntimeError: NCCL Error 2: system error

This is with PyTorch v0.4.0, built from source. CUDA version is 9.0, cuDNN version is 7.0.5.
I don’t run into an exception when I use the PyTorch available on Conda.

Any help much appreciated :slight_smile:

Turns out this is related to the NCCL version: 2.1.15-1 works fine.