Use multi-GPU in pytorch

When I use multi-GPU with nn.DataParallel, I get the following error:

Message: NCCL Error 2: system error
Traceback:
[1] File “/root/Workspace/ptNest/src/ptnest/libs/base/optimizer.py”, line 180, in function “network_trainer”
process(state, ‘train’, train_loader)
[2] File “/root/Workspace/ptNest/src/ptnest/libs/base/optimizer.py”, line 131, in function “process”
output_var = model(input_var)
[3] File “/root/Util/miniconda/envs/py3.5/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 357, in function “call
result = self.forward(*input, **kwargs)
[4] File “/root/Util/miniconda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 72, in function “forward”
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
[5] File “/root/Util/miniconda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 77, in function “replicate”
return replicate(module, device_ids)
[6] File “/root/Util/miniconda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/replicate.py”, line 12, in function “replicate”
param_copies = Broadcast.apply(devices, *params)
[7] File “/root/Util/miniconda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/_functions.py”, line 17, in function “forward”
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
[8] File “/root/Util/miniconda/envs/py3.5/lib/python3.5/site-packages/torch/cuda/comm.py”, line 63, in function “broadcast_coalesced”
results = broadcast(flat, devices)
[9] File “/root/Util/miniconda/envs/py3.5/lib/python3.5/site-packages/torch/cuda/comm.py”, line 26, in function “broadcast”
nccl.broadcast(tensors)
[10] File “/root/Util/miniconda/envs/py3.5/lib/python3.5/site-packages/torch/cuda/nccl.py”, line 50, in function “broadcast”
torch._C._nccl_broadcast(inputs, root)

Any idea how the problem occur?

Thanks

1 Like