I have a model which contains two parts. The first part “model1” takes one image and outputs a feature ‘model1_feat’. The second part “model2” takes the ‘model1_feat’ and another feature ‘input_feat’ as input, and generate the final output. I want to train this model on multi GPUs. I have writen the following code:
But it does not work. The whole thread is blocked and the model can not generate any output. Can you help me.
BTW, the total model works fine on single card.
Your model is like a conditional GAN, I`m also doing some experiments like yours.
I think you should put the both model on multi GPU first, and in the training procedure, put the model1_feat and input_feat to the model2, like this:
model1 = nn.DataParallel(model1).cuda()
model2 = nn.DataParallel(model2).cuda()
# in training procedure
model1_feat = model1(input_image)
model2_feat = model2(model1_feat, input_feat)
and you can set the multi GPU in command like CUDA_VISIBLE_DEVICES=0,1.
as I know, you can not pass tensors between different GPUs in running procedure.
Thanks for your reply. I think you are right. But the problem is that the GPU memory is extremely unbalanced. The first GPU comsumes a lot of memory while othes only used a little. For example
| 0 22043 C /usr/bin/python 11138MiB |
| 1 22043 C /usr/bin/python 5724MiB |
| 2 22043 C /usr/bin/python 5548MiB |
| 3 22043 C /usr/bin/python 5613MiB
gpu_ids = [2, 3, 4]
torch.cuda.set_device(gpu_ids[0]) #fix the bug for " RuntimeError: all tensors must be on devices[0] "
for use multigpu in dataset loader use: pin_memory=True
model = torch.nn.DataParallel(model, device_ids=gpu_ids)
model.cuda()
for vars in train use:
target_var = torch.autograd.Variable(target.cuda(async=True))
input_var = torch.autograd.Variable(input.cuda(async=True), requires_grad=True, volatile=False)
for vars in test stage:
target_var = torch.autograd.Variable(target.cuda(async=True))
input_var = torch.autograd.Variable(input.cuda(async=True), volatile=True)
hi but when i try this i get error in my loss function, maybe the targets remain in gpu 1 and model outputs in gpu 0 .
this is the error I get in my loss function :
buffer[torch.eq(target, -1.)] = 0
RuntimeError: invalid argument 2: sizes do not match at /opt/conda/conda-bld/pytorch_1512946747676/work/torch/lib/THC/generated/…/generic/THCTensorMasked.cu:13
this is not an error in my code but an error popping up after using parallelism of data ( as i tried to run my less intensive code both with and without data parallelism and it throws up same error while using it with data parallelsim)
my model is intensive and I have 2 gpu’s 12206MiB each. I just need to split my model to use both gpu’s while training as well as testing.
thanks
btw my model is a fcn and its batch size is 1