NCCL/CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2)

adrianalbert · October 27, 2017, 4:18pm

Hi,

I’m trying to run several GAN architectures in PyTorch (running pytorch.0.2) in a docker container (started using nvidia-docker) on 4 NVIDIA K80 GPUs. The code runs fine in CPU or single GPU mode. However when trying to use multiple GPUs it crashes. I’ve tried this experiment for both the DCGAN and the DiscoGAN architectures.

The error I get in each case (torch.cuda.nccl.NcclError: System Error (2)) is below. Any ideas?

Traceback (most recent call last):
File “main.py”, line 41, in
main(config)
File “main.py”, line 33, in main
trainer.train()
File “/home/nbserver/DiscoGAN-pytorch/trainer.py”, line 193, in train
x_AB = self.G_AB(x_A).detach()
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 224, in
call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py”, line
59, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py”, line
64, in replicate
return replicate(module, device_ids)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/replicate.py”, line 12,
in replicate
param_copies = Broadcast(devices)(*params)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/_functions.py”, line 19
, in forward
outputs = comm.broadcast_coalesced(inputs, self.target_gpus)
File “/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py”, line 54, in broadcas
t_coalesced
results = broadcast(_flatten_tensors(chunk), devices)
File “/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py”, line 24, in broadcas
t
nccl.broadcast(tensors)
File “/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py”, line 182, in broadca
st
comm = communicator(inputs)
File “/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py”, line 133, in communi
cator
_communicators[key] = NcclCommList(devices)
File “/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py”, line 106, in _init
_
check_error(lib.ncclCommInitAll(self, len(devices), int_array(devices)))
File “/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py”, line 118, in check_e
rror
raise NcclError(status)
torch.cuda.nccl.NcclError: System Error (2)

richard · October 27, 2017, 5:53pm

I haven’t seen this one before, but here are a few suggestions:

Try this out and see if anything happens: https://github.com/pytorch/pytorch/issues/1637#issuecomment-338268158
Install pytorch from source and see if that fixes the bug?

adrianalbert · October 27, 2017, 9:08pm

Thanks for the quick reply!

What turned out to work for me was to update the nvidia driver to NVIDIA-Linux-x86_64-375.66.run.