Multi-gpu running problem

santisy · July 18, 2017, 4:55am

I was trying to run code on 4 GPUs, the GPU id is 4, 5, 6, 7. However I got this problem. When I am running on GPU 0, 1, 2, 3, it is fine. Does anyone have any idea about the reason here?

Traceback (most recent call last):
  File "main_boxencoder_new_loss.py", line 279, in <module>
    errD_real.backward()
  File "/home/didoyang/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.py", line 155, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/didoyang/anaconda2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 98, in backward
    variables, grad_variables, retain_graph)
  File "/home/didoyang/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/_functions.py", line 25, in backward
    return comm.reduce_add_coalesced(grad_outputs, self.input_device)
  File "/home/didoyang/anaconda2/lib/python2.7/site-packages/torch/cuda/comm.py", line 122, in reduce_add_coalesced
    result = reduce_add(flattened, destination)
  File "/home/didoyang/anaconda2/lib/python2.7/site-packages/torch/cuda/comm.py", line 92, in reduce_add
    nccl.reduce(inputs, outputs, root=destination)
  File "/home/didoyang/anaconda2/lib/python2.7/site-packages/torch/cuda/nccl.py", line 161, in reduce
    assert(root >= 0 and root < len(inputs))
AssertionError

hughperkins · July 18, 2017, 11:20am

Is it something like, your batch size is 4, and on 8 gpus, there are not examples to go around? (I’m not saying it is, but just checking this point)

Elias_Vansteenkiste · September 18, 2017, 8:51pm

I had the same problem with the same error message.
It occurs when I use GPU2 and GPU3 of the system.
The batch size is 32 so certainly larger than the number of GPUs.

To solve this problem I used the CUDA_VISIBLE_DEVICES flag:
CUDA_VISIBLE_DEVICES=2,3 python3 train.py --gpu_ids 0,1