I have a model which contains two parts. The first part “model1” takes one image and outputs a feature ‘model1_feat’. The second part “model2” takes the ‘model1_feat’ and another feature ‘input_feat’ as input, and generate the final output. I want to train this model on multi GPUs. I have writen the following code:
Your model is like a conditional GAN, I`m also doing some experiments like yours.
I think you should put the both model on multi GPU first, and in the training procedure, put the model1_feat and input_feat to the model2, like this:
model1 = nn.DataParallel(model1).cuda()
model2 = nn.DataParallel(model2).cuda()
# in training procedure
model1_feat = model1(input_image)
model2_feat = model2(model1_feat, input_feat)
and you can set the multi GPU in command like CUDA_VISIBLE_DEVICES=0,1.
as I know, you can not pass tensors between different GPUs in running procedure.
Thanks for your reply. I think you are right. But the problem is that the GPU memory is extremely unbalanced. The first GPU comsumes a lot of memory while othes only used a little. For example
| 0 22043 C /usr/bin/python 11138MiB |
| 1 22043 C /usr/bin/python 5724MiB |
| 2 22043 C /usr/bin/python 5548MiB |
| 3 22043 C /usr/bin/python 5613MiB
hi but when i try this i get error in my loss function, maybe the targets remain in gpu 1 and model outputs in gpu 0 .
this is the error I get in my loss function :
buffer[torch.eq(target, -1.)] = 0
RuntimeError: invalid argument 2: sizes do not match at /opt/conda/conda-bld/pytorch_1512946747676/work/torch/lib/THC/generated/…/generic/THCTensorMasked.cu:13
this is not an error in my code but an error popping up after using parallelism of data ( as i tried to run my less intensive code both with and without data parallelism and it throws up same error while using it with data parallelsim)
my model is intensive and I have 2 gpu’s 12206MiB each. I just need to split my model to use both gpu’s while training as well as testing.
btw my model is a fcn and its batch size is 1