DataParallel raises NcclError

Hi, all,
I met a NcclError while running someone’s code. So I try a simpler way to test DataParallel function. So I changed the official mnist tutorial code into DataParallel version like this
if args.cuda: model =torch.nn.DataParallel(model).cuda()

Then it shows the same error as I met in someone’s code.
Traceback (most recent call last): File "tu_mnist.py", line 126, in <module> train(epoch) File "tu_mnist.py", line 90, in train output = model(data) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__ result = self.forward(*input, **kwargs) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 60, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 65, in replicate return replicate(module, device_ids) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate param_copies = Broadcast(devices)(*params) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 18, in forward outputs = comm.broadcast_coalesced(inputs, self.target_gpus) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/cuda/comm.py", line 57, in broadcast_coalesced results = broadcast(_flatten_tensors(chunk), devices) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/cuda/comm.py", line 26, in broadcast nccl.broadcast(tensors) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/cuda/nccl.py", line 180, in broadcast comm = communicator(inputs) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/cuda/nccl.py", line 133, in communicator _communicators[key] = NcclCommList(devices) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/cuda/nccl.py", line 106, in __init__ check_error(lib.ncclCommInitAll(self, len(devices), int_array(devices))) File "/root/Util/miniconda/envs/pt/lib/python3.6/site-packages/torch/cuda/nccl.py", line 118, in check_error raise NcclError(status) torch.cuda.nccl.NcclError: System Error (2)

I run my experimens in Nvidia docker enviroment, CUDA 7.5, K80. Does anybody know why?

I got the same error. Did you solve it?

Not a clue. I suspect that it’s because of docker. Is your environment docker too?

Hi, @apaszke , could you kindly give a guess of reason.

No, I’m using Anaconda. I updated the Pytorch and the error was gone away.

However, my Pytorch is still not working properly in Parallel mode. It can take gpu resourses and go without reporting any error, but not show anything on terminal. It doesn’t save model as well. All of this stuff only happen when I use DataParallel function.

Any idea.

same problem here. running pytorch in docker and got the same issue. It was working fine on my computer but then possibly due to an auto-update of Ubuntu my nvidia drivers were messed up. I reinstalled all my nvidia drivers. I had nvidia driver version 375 before and now I have 384. Since the reinstallation im not able to run pytorch DataParallel. I’m running cuda-8.0. This is definately a docker related issue because I can run locally just fine (albeit, im running pytorch 0.2.0_3 locally and in the docker image Im running 0.1.12_2).

@tgaaly are you using nvidia-docker? If your base machine’s driver is different from the docker driver, regular docker doesn’t work and only nvidia-docker works.

The fix to the problem for me was to revert to NVIDIA display driver version 375.66. I am using nvidia-docker so not sure why this happened. Will have to take a look.

I got the same error. But I am using anaconda. I solved it just to remove nccl pkg in conda by cmd “conda remove nccl”.