CUDA memory out

When I use one GPU, 'CUDA memory out' occurs.
so I add one more GPU with nn.DataParallel(model) but still memory out occurs.
I searched google about this, then found output_device settings like
model = nn.DataParallel(model, output_device=1).
model2 = nn.DataParallel(model2, output_device=1).

but I got a runtime error.
How can I solve it except decreasing batch size ?

RuntimeError: Assertion `THCTensor_(checkGPU)(state, 4, input, target, output, total_weight)’ failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:28

This error happens because you Give to your NLLLoss inputs and targets that are on different devices. They all need to be on the same GPU.