I’m running into the following error while using
DistributedDataParallel, using code very similar to the code here:
Traceback (most recent call last): File "/root/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/root/miniconda2/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 476, in _reduction_thread_fn _process_batch() # just to have a clear scope File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 460, in _process_batch nccl.reduce(dev_coalesced, root=0, streams=nccl_streams) File "/root/miniconda2/lib/python2.7/site-packages/torch/cuda/nccl.py", line 51, in reduce torch._C._nccl_reduce(inputs, outputs, root, op, streams, comms) RuntimeError: NCCL Error 2: system error
This is with PyTorch v0.4.0, built from source. CUDA version is 9.0, cuDNN version is 7.0.5.
I don’t run into an exception when I use the PyTorch available on Conda.
Any help much appreciated