I’m running into the following error while using DistributedDataParallel
, using code very similar to the code here:
Traceback (most recent call last):
File "/root/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/miniconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 476, in _reduction_thread_fn
_process_batch() # just to have a clear scope
File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 460, in _process_batch
nccl.reduce(dev_coalesced, root=0, streams=nccl_streams)
File "/root/miniconda2/lib/python2.7/site-packages/torch/cuda/nccl.py", line 51, in reduce
torch._C._nccl_reduce(inputs, outputs, root, op, streams, comms)
RuntimeError: NCCL Error 2: system error
This is with PyTorch v0.4.0, built from source. CUDA version is 9.0, cuDNN version is 7.0.5.
I don’t run into an exception when I use the PyTorch available on Conda.
Any help much appreciated