Multi-GPU Training Different Memory Usage

Is it possible to set GPUs other than the first one (0) to use more memory?

Try to change the order using CUDA_VISIBLE_DEVICES=1,0 python script.py args.
This should change both devices, so that data will be accumulated on GPU1.

Thank you very much for the reply. Is this equvilant to the following?
model = nn.DataParallel(model, device_ids=[1, 0])

I think you are right! The first device_id will be used as the output_device.
Also, I think you could just set output_device to the id you want to accumulate your updates on.
Here are the important lines of code.

Great! Thanks a lot!

I did the following for PyTorch 0.4.1:
model = nn.DataParallel(model, device_ids=[3, 0, 1, 2])

And got:
torch/cuda/comm.py", line 40, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

Is it working if you pass device_ids=[0, 3, 1, 2]? or output_device=3?
I don’t have multiple GPUs currently, otherwise I would test it quickly.

On a 4 x 1080Ti cluster:
I passed device_ids=[3, 0, 1, 2] without setting output_device=3, it didn’t work. I got the error:
torch/cuda/comm.py", line 40, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

However, on my own 2 x Titan Xp Linux box:
When I set CUDA_VISIBLE_DEVICES=1,0 with or without passing device_ids=[0, 1], it worked, the second Titan Xp was using more memory.

When I only passed device_ids=[1, 0] without setting CUDA_VISIBLE_DEVICES, I got error:
torch/cuda/comm.py", line 40, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

I passed device_ids=[0, 1] and output_device=1, and I got the following error:
torch/nn/functional.py", line 1407, in nll_loss
return torch._C._nn.nll_loss(input, target, weight, Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `THCTensor
(checkGPU)(state, 4, input, target, output, total_weight)’ failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:29