Can input not be assigned to 0 gpu only?

I’n trying to train models with 4 gpus(1080ti) with batch of 128, but got out of memory problem.
The gpu0 is actually out of memory, and other 3 gpus has 7G out of 12G which is very sufficient.
The main reason is that all inputs has to be on gpu 0 first, and it is replicated to other gpus later.
It would be much efficient of inputs can be run in parallel. Is it possible?

As far as i know it’s not possible, what you can do to optimize the memory usage is the following:
target and ground-truth must be in the same gpu.

When you call data parallel you can allocate output and gt to cuda1 and input to cuda1


Here you are calling model in gpu0, but output in gpu1

Now when u allocate input and gt


So output and gt are in the same gpu, and input in a 3rd gpu